1 Using Database Connections
This section introduces this library’s basic features and covers some practical issues with database programming in general and with this library in particular.
1.1 Introduction to Using Database Connections
The following annotated program demonstrates how to connect to a database and perform simple queries. Some of the SQL syntax used below is PostgreSQL-specific, such as the syntax of query parameters ($1 rather than ?).
> (require db)
First we create a connection. Replace user, db, and password below with the appropriate values for your configuration (see Base Connections for other connection examples):
> (define pgc (postgresql-connect #:user user #:database db #:password password))
Use query-exec method to execute a SQL statement for effect.
> (query-exec pgc "create temporary table the_numbers (n integer, d varchar(20))")
> (query-exec pgc "insert into the_numbers values (0, 'nothing')")
> (query-exec pgc "insert into the_numbers values (1, 'the loneliest number')")
When a query contains a SQL value that isn’t constant, refer to it through a “query parameter” rather than by dynamically computing the SQL query string (see also SQL Injection). Just provide the parameter values after the SQL statement in the query function call:
> (query-exec pgc "insert into the_numbers values ($1, $2)" (+ 1 1) "company")
Every standard query function accepts query parameters. The SQL syntax for query parameters depends on the database system (see Statements). Other options for running parameterized queries are discussed below.
The query function is a more general way to execute a statement. It returns a structure encapsulating information about the statement’s execution. (But some of that information varies from system to system and is subject to change.)
> (query pgc "insert into the_numbers values (3, 'a crowd')") (simple-result '((insert-id . #f) (affected-rows . 1)))
> (query pgc "select n, d from the_numbers where n % 2 = 0")
(rows-result
'(((name . "n") (typeid . 23) (type-size . 4) (type-mod . -1))
((name . "d") (typeid . 1043) (type-size . -1) (type-mod . 24)))
'(#(0 "nothing") #(2 "company")))
When the query is known to return rows and when the field descriptions are not needed, it is more convenient to use the query-rows function.
> (query-rows pgc "select n, d from the_numbers where n % 2 = 0") '(#(0 "nothing") #(2 "company"))
Use query-row for queries that are known to return exactly one row.
> (query-row pgc "select * from the_numbers where n = 0") '#(0 "nothing")
Similarly, use query-list for queries that produce rows of exactly one column.
> (query-list pgc "select d from the_numbers order by n") '("nothing" "the loneliest number" "company" "a crowd")
When a query is known to return a single value (one row and one column), use query-value.
> (query-value pgc "select count(*) from the_numbers") 4
> (query-value pgc "select d from the_numbers where n = 5") query-value: query returned wrong number of rows
statement: "select d from the_numbers where n = 5"
expected: 1
got: 0
When a query may return zero or one rows, as the last example, use query-maybe-row or query-maybe-value instead.
> (query-maybe-value pgc "select d from the_numbers where n = 5") #f
The in-query function produces a sequence that can be used with Racket’s iteration forms:
> (for ([(n d) (in-query pgc "select * from the_numbers where n < 4")]) (printf "~a: ~a\n" n d))
0: nothing
1: the loneliest number
2: company
3: a crowd
> (for/fold ([sum 0]) ([n (in-query pgc "select n from the_numbers")]) (+ sum n)) 6
Errors in queries generally do not cause the connection to disconnect.
> (begin (with-handlers [(exn:fail? (lambda (e) (printf "~a~n" (exn-message e))))] (query-value pgc "select NoSuchField from NoSuchTable")) (query-value pgc "select 'okay to proceed!'"))
query-value: relation "nosuchtable" does not exist
SQLSTATE: 42P01
"okay to proceed!"
Queries may contain parameters. The easiest way to execute a parameterized query is to provide the parameters “inline” after the SQL statement in the query function call.
> (query-value pgc "select d from the_numbers where n = $1" 2) "company"
> (query-list pgc "select n from the_numbers where n > $1 and n < $2" 0 3) '(1 2)
Alternatively, a parameterized query may be prepared in advance and executed later. Prepared statements can be executed multiple times with different parameter values.
> (define get-less-than-pst (prepare pgc "select n from the_numbers where n < $1"))
> (query-list pgc get-less-than-pst 1) '(0)
> (query-list pgc (bind-prepared-statement get-less-than-pst '(2))) '(0 1)
When a connection’s work is done, it should be disconnected.
> (disconnect pgc)
1.2 Database Security
Database security requires both that the database back end be secured against unauthorized use and that authorized clients are not tricked or subverted into violating the database’s security.
Securing database back ends is mostly beyond the scope of this manual. In brief: choose sufficiently strong authentication methods and keep credentials secure, and follow the principle of least privilege: create and use roles that have the minimum permissions needed.
The following is an incomplete list of security issues related to database client programming.
1.2.1 SQL Injection
SQL injection happens when part of a SQL
statement that was intended as SQL literal data is instead interpreted
as SQL code—
Avoid dynamically creating SQL query strings by string concatenation or interpolation (eg, with string-append or format). In most cases, it is possible to use parameterized queries instead. For example, instead of this
; WRONG! DANGER! (query-exec c (format "UPDATE users SET passwd='~a' WHERE user='~a'" user new-passwd))
; for PostgreSQL, SQLite (query-exec c "UPDATE users SET passwd=$1 WHERE user=$2" user new-passwd) ; for MySQL, SQLite, ODBC (query-exec c "UPDATE users SET passwd=? WHERE user=?" user new-passwd)
The first form would choke on names like "Patrick O'Connor". Worse, it would be susceptible to attack by malicious input like "me' OR user='root'", which yields the following SQL statement:
UPDATE users SET passwd='whatever' WHERE user='me' OR user='root'
In contrast, using a parameterized query causes the parameterized SQL and its arguments to be submitted to the back end separately; the back end then combines them safely.
Only SQL literal values can be replaced with parameter placeholders; a SQL statement cannot be parameterized over a column name or a sort order, for example. In such cases, constructing the query dynamically may be the only feasible solution. But while the query construction may be influenced by external input, it should never directly incorporate external input without validation. That is, don’t do the following:
; WRONG! DANGER! (query-rows c (format "SELECT name, ~a FROM contestants" column)) (query-list c (format "SELECT name FROM contestants ORDER BY score ~a" direction))
Instead, select the inserted SQL from known good alternatives:
; BETTER (query-rows c (format "SELECT name, ~a FROM contestants" (cond [(member column '("wins" "losses")) column] [else (error ....)]))) (query-list c (format "SELECT name FROM contestants ORDER BY score ~a" (if ascending? "ASC" "DESC")))
1.2.2 Cross-site Scripting (XSS)
Cross-site scripting—
This issue has little to do with databases per se except that such text is often stored in a database. This issue is mitigated by using structured markup representations like SXML or X-expressions (xexprs), since they automatically escape “markup” characters found in embedded text.
1.3 Database Performance
Achieving good database performance mostly consists of good database design and intelligent client behavior.
On the database design side, most important are wise use of indexes and choosing appropriate data representations. As an example of the latter, a regexp-based search using LIKE will probably be slower than a specialized full-text search feature for large data sets. Consult your database back end’s manual for additional performance advice.
The following sections describe a few client-side aspects of performance.
1.3.1 The N+1 Selects Problem
A common mistake is to fetch a large amount of data by running a query to get a set of initial records and then running another query inside a loop with an iteration for each of the initial records. This is sometimes called the “n+1 selects problem.” For example:
(for/list ([(name id) (in-query c "SELECT name, id FROM contestants")]) (define wins (query-list c "SELECT contest FROM contests WHERE winner = $1" id)) (make-contestant-record name wins))
The same information can be retrieved in a single query by performing a LEFT OUTER JOIN and grouping the results:
(for/list ([(name id wins) (in-query c (string-append "SELECT name, id, contest " "FROM contestants LEFT OUTER JOIN contests " "ON contestants.id = contests.winner") #:group '(#("name" "id")) #:group-mode '(list))]) (make-contestant-record name wins))
The one-query form will perform better when database communication has high latency. On the other hand, it may duplicate the contents of the non-key name column, using more bandwidth. Another approach is to perform two queries:
(let ([id=>name (rows->dict #:key "id" #:value "name" (query c "SELECT id, name FROM contestants"))]) (for/list ([(id wins) (in-query c (string-append "SELECT id, contest " "FROM contestants LEFT OUTER JOIN contests " "ON contestants.id = contests.winner") #:group '(#("id")) #:group-mode '(list))]) (make-contestant-record (dict-ref id=>name id) wins)))
Compared with the one-query form, the two-query form requires additional communication, but it avoids duplicating name values in the OUTER JOIN results. If additional non-key contestant fields were to be retrieved, the bandwidth savings of this approach would be even greater.
See also Testing Performance of Database-Backed Programs.
1.3.2 Updates and Transactions
Using transactions can dramatically improve the performance of bulk database operations, especially UPDATE and INSERT statements. As an extreme example, on commodity hardware in 2012, SQLite is capable of executing thousands of INSERT statements per second within a transaction, but it is capable of only dozens of single-INSERT transactions per second.
1.3.3 Statement Caching
Connections cache implicitly prepared statements (that is, statements given in string form directly to a query function). The effect of the cache is to eliminate an extra round-trip to the server (to send the statement and receive a prepared statement handle), leaving just a single round-trip (to send parameters and receive results) per execution.
Currently, prepared statements are only cached within a transaction. The statement cache is flushed when entering or leaving a transaction and whenever a DDL statement is executed.
1.3.4 Testing Performance of Database-Backed Programs
When testing the performance of database-backed programs, remember to test them in environments with realistic latency and bandwidth. High-latency environments may be roughly approximated with the high-latency-connection function, but there’s no substitute for the real thing.
1.3.5 Transactions and Concurrency
Database systems use transactions to guarantee properties such as atomicity and isolation while accommodating concurrent reads and writes by the database’s clients. Within a transaction a client is insulated from the actions of other clients, but the transaction may be aborted and rolled back if the database system cannot reconcile it with other concurrent interactions. Some database systems are more adept at reconciling transactions than others, and most allow reconciliation to be tuned through the specification of isolation levels.
PostgreSQL supports
very
fine-grained reconciliation: two transactions that both read and
modify the same table concurrently might both be allowed to complete
if they involve disjoint sets of rows. However, clients should be
prepared to retry transactions that fail with a exn:fail:sql
exception with SQLSTATE matching #rx"^40...$"—
MySQL’s transaction behavior varies based on the storage drivers in use. Clients should be prepared to retry transactions that fail with a exn:fail:sql exception with SQLSTATE matching #rx"^40...$".
SQLite enforces a very coarse-grained policy: only one transaction is allowed to write to the database at a time, and thus concurrent writers are very likely to conflict. Clients should be prepared to retry transactions that fail with a exn:fail:sql exception with SQLSTATE of 'busy.
An alternative to retrying whole SQLite transactions is to start each transaction with the appropriate locking level, since a transaction usually fails when it is unable to upgrade its lock level. Start a transaction that only performs reads in the default mode, and start a transaction that may perform writes in 'immediate mode (see start-transaction). That converts the problem of retrying whole transactions into the problem of retrying the initial BEGIN TRANSACTION statment, and this library already automatically retries individual statements that fail with 'busy errors. Depending on the length and frequency of the transactions, you may need to adjust busy-retry-limit (see sqlite3-connect).
ODBC’s behavior varies depending on the driver and back end. See the appropriate database system’s documentation.
1.4 Databases and Web Servlets
Using database connections in a web servlet is more complicated than in a standalone program. A single servlet potentially serves many requests at once, each in a separate request-handling thread. Furthermore, the use of send/suspend, send/suspend/dispatch, etc means that there are many places where a servlet may start and stop executing to service a request.
Why not use a single connection to handle all of a servlet’s requests? That is, create the connection with the servlet instance and never disconnect it. Such a servlet would look something like the following:
"bad-servlet.rkt"
#lang web-server (require db) (define db-conn (postgresql-connect ....)) (define (serve req) .... db-conn ....)
The main problem with using one connection for all requests is that multiple threads accessing the same connection are not properly isolated. For example, if one thread is accessing the connection within a transaction and another thread issues a query, the second thread may see invalid data or even disrupt the work of the first thread.
A secondary problem is performance. A connection can only perform a single query at a time, whereas most database systems are capable of concurrent query processing.
The proper way to use database connections in a servlet is to create a connection for each request and disconnect it when the request has been handled. But since a request thread may start and stop executing in many places (due to send/suspend, etc), inserting the code to connect and disconnect at the proper places can be challenging and messy.
A better solution is to use a virtual connection, which automatically creates a request-specific (that is, thread-specific) “actual connection” by need and disconnects it when the request has been handled (that is, when the thread terminates). Different request-handling threads using the same virtual connection are assigned different actual connections, so the requests are properly isolated.
"better-servlet.rkt"
#lang web-server (require db) (define db-conn (virtual-connection (lambda () (postgresql-connect ....)))) (define (serve req) .... db-conn ....)
This solution preserves the simplicity of the naive solution and fixes the isolation problem but at the cost of creating many short-lived database connections. That cost can be eliminated by using a connection pool:
"best-servlet.rkt"
#lang web-server (require db) (define db-conn (virtual-connection (connection-pool (lambda () (postgresql-connect ....))))) (define (serve req) .... db-conn ....)
By using a virtual connection backed by a connection pool, a servlet can achieve simplicity, isolation, and performance all at the same time.