Canonical SQL

jump to table of contents

If you spend much time online in places where SQL is discussed and questions about SQL are asked, and especially if you have more than a passing acquaintance with the relational model, you'll soon notice that many of the questions are the same. They go like this:

I am programming in environment E using tools T for application Y. I have tables A and B with N thousand rows and need to do X with the database. It has to be fast and efficient. What is the best way?

This page is about operation X. You can plug in your favorite values for E, T, Y, A, B, and N.

The question doesn't seem commonplace to the person asking; it seems very specific, very special. The fact is, though, that every such question is of one of two kinds:

  1. those that cannot be solved with relational algebra using SQL
  2. those that can

The first kind of question lies outside the relational model. Some questions leave very few clues about the data and the hoped-for result. Some tables were designed without regard for the advantages of BCNF normal form. Some questions aren't about SQL at all, really, but about the programming environment. The second kind can be answered with a generic example if the person asking it can make the mental leap of substituting his tables and columns for those in the example.

This page is devoted to the person who understands that every query he writes has been written before in some form, that a query is nothing more or less than a logical construct, an equation. Distilled from the application, every select is a demand of the DBMS: Join these tables on these columns subject to these constraints.

If you have have come to that realization, you're halfway there. If you can state the problem algebraically in terms of sets, then perhaps an example can be adapted to your purpose. Perhaps, too, the possibilities posed by the examples you find here will influence your design. The ability to derive data using select reduces the need to maintain them with update.

The Pattern and the Anti-Pattern

It's fashionable these days to speak of programming patterns. I resist the temptation to join that parade. SQL, for all its defects, is based on math and can be thought of in mathematical terms. In centuries of mathematics, no one has ever suggested that equations have “patterns”. The mathematical pattern, if you will, is to use operators and operands to formulate new derivations. Because SQL rests on a mathematical foundation (unlike, say, object orientation) we can discuss it in mathematical terms instead of programming terms.

That said, there are patterns and anti-patterns when it comes to database application programming. The programming languages we use today have nothing like the expressive power of SQL, and most DBMS access libraries and frameworks succeed best at reaching for the lowest common denominator. Rather than exploiting the DBMS's capabilities, they thwart it. When it comes to Object Relational Management frameworks, the pattern is the anti-pattern. If you are experiencing performance problems with your DBMS and you're using some kind of framework that relieves you from writing your own SQL, chances are the problem stems not from the database design, but from the application design.

You might even call that a pattern.

Taxonomy of the Examples

The organization of the examples is a work in progress. SQL tutorials often begin with a simple select statement from one table, then to introduce where, then join and so on. But we don't need to start at the very beginning.

The best solution I've come up with so far is to write answers to what I see as Frequently Asked Questions. As the library of answers accumulates, perhaps some system of organization will suggest itself. So far, it's a matter of improvisation.

This page provides examples of SQL queries that answer questions frequently posted on the Web. The sample schema and data are the same as those used by C.J. Date in his popular textbook, An Introduction to Database Systems, now in its 8th edition. Please refer to Suppliers and Parts [pdf] for a description of the tables and data.

Table of contents

Good Questions

Broken Questions

Ranks and Running Sums

To say anything about a row relative to other rows in the same table, it is necessary to do two things:

  1. establish an order: what does “before” mean?
  2. perform a self-join

Self-joins — where a table is joined to itself — is usually presented as a special case, and on first encounter it often seems weird. But it's not, any more than 2 + 2 = 4 or y = x + x is weird. join is a binary operator (that is, it operates on two tables). Nothing says those two tables can't be the same one.

Rank Rows

This question takes two forms:

  1. Generate an index number for rows in a table “in order”
  2. Count rows before “this row”
Query Results
select count(lesser.SID) as RANK, S.SID
from S as S
left outer join S as lesser
on   S.SID > lesser.SID
group by S.SID
order by S.SID;
 
RANK SID
0 S1
1 S2
2 S3
3 S4
4 S5

An outer join is required to accommodate the first row, which has no lesser.

Running Totals

Show accumulated Y as X increases (usually over time). In this example we accumulate STATUS as SID increases.

Query Results
select S.SID, S.STATUS, sum(totals.STATUS) as 'TOTAL'
from S as S
join S as totals
on   S.SID >= totals.SID
group by S.SID
order by S.SID;
 
SID STATUS TOTAL
S1 20 20
S2 10 30
S3 30 60
S4 20 80
S5 30 110

Existence

Are there any news?
Horace Greeley

join answers the question, which rows in table B meet criteria in table A. Often, though, we don't care about which rows, but only whether any rows match, a/k/a an existence test. For that, SQL has the operator exists.

exists comes from relational calculus. It's formulated as a correlated subquery. Many people find the correlated subquery syntax odd and shy away from it, preferring good old trustworthy join. Others are under the mistaken impression that correlated subqueries are slow, something that was once true of many implementations, and is less often the case today.

My advice: never send a join to do an exists job.

Why use exists? First, it embodies the question as you might phrase it logically, e.g. show rows in A where value V appears in B or show users that logged on last night. Second, it avoids duplicates. When we join A to B, the A columns are repeated for every matching B row. When we use exists, we get only as many rows in A as are there, regardless of how many (more than one) appear in B.

Where X is true in another table

This might be expressed as Show parts with a supplier in the same city. A join on CITY would produce too many many rows: we don't want to show all supplier-part pairs, but just those parts that have any supplier in the same city.

Query Results
select * 
from S
where exists (
      select 1
      from P
      where CITY = S.CITY
);
 
SID SNAME STATUS CITY
S1 Smith 20 London
S2 Jones 10 Paris
S3 Blake 30 Paris
S4 Clark 20 London

Notes:

Rows missing from another table

This might be expressed as Show parts with no supplier in the same city.

Query Results
select * 
from S
where not exists (
      select 1
      from P
      where CITY = S.CITY
);
 
SID SNAME STATUS CITY
S5 Adams 30 Athens

First of N

For example, Show the supplier with the largest quantity of each part.

Query Results
select PID, SID, QTY
from SP as quantities
where exists (
      select 1
      from SP
      where PID = quantities.PID
      group by PID
      having max(QTY) = quantities.QTY
)
order by PID, SID;

PID SID QTY
P1 S1 300
P1 S2 300
P2 S2 400
P3 S1 400
P4 S4 300
P5 S4 400
P6 S1 100

Placing the order-by columns first in the select clause makes it easier to spot duplicates or missing rows.

Note the “tie for first” for P1. Sometimes ties are OK; sometimes additional criteria need to be applied. To filter the results yet further, convert the base query into a view — or a temporary view, a/k/a “Common Table Expression” — and apply the same where-exists logic to the view.

Time Series

Time-series data deserve honorable mention for the database design and query issues they raise. Time-series data

Each of these aspects presents challenges. It's not uncommon to find TS tables with millions, or hundreds of millions, or even billions of rows. Pricing history on the stock exchange, for instance, accumulates at the rate of millions per second.

The question of when needs to be clearly defined. The high temperature yesterday (at 3:00 PM, in Central Park) was 36° Fahrenheit. That's true of a point in time. The accumulated rainfall this month as of today is 1.5 inches. That's true as of a point in time for a time interval i.e. month-to-date.

These difference are sometimes manifested in the table design. Point-in-time data need only one column to describe when. Time interval data need two, start time and end time. If end time is not yet known, it must be able to represent now somehow, either by a special value or by no value i.e. NULL.

Because the Suppliers & Parts database has no dates, we're forced to use something else. This example comes from Stack Overflow.

Tables with Dates*
Table A
ID StartDate EndDate Flag
1 01 Jan 2008 23 Mar 2008 1
1 23 Mar 2008 15 Jun 2008 0
1 15 Jun 2008 18 Aug 2008 1
 
Table B
ID StartDate EndDate Flag
1 19 Jan 2008 17 Feb 2008 1
1 17 Feb 2008 15 Jun 2008 0
1 15 Jun 2008 18 Aug 2008 1
Output
ID StartDate EndDate A_Flag B_Flag
1 01 Jan 2008 19 Jan 2008 1 0
1 19 Jan 2008 17 Feb 2008 1 1
1 17 Feb 2008 23 Mar 2008 1 0
1 23 Mar 2008 15 Jun 2008 0 0
1 15 Jun 2008 18 Aug 2008 1 1

* Primary key columns are underlined and highlighted.

(It may bear noting here that the end date need not be stored if it can always be computed from the next latest start date. In this example, most of the EndDate values are redundant.)

The question is, how to show the unique values of flags A and B over time?

That raises another question: what does end date mean? Sometimes it's an inclusive date i.e. up to and including that date. Better — easier to program and easier to manipulate in the database — is an exclusive end date. An exclusive end date matches the next start date, making for a simple and efficient join. In this example we can infer the end date exclusive because several end dates match other start dates.

Before we go too far, let's make sure the data are internally consistent. In each table,

To verify that constraint holds — that no row in either table represents a time period overlapping any other row — test for existence:

	select count(*) as nrows
	from so.B as b 
	where exists (
	  select 1 from so.B 
	  where ID = b.ID 
	  and b.StartDate <= StartDate and StartDate < b.EndDate 
	   -- not the very same record
	  and not (StartDate = b.StartDate and EndDate = b.EndDate)
	)
	nrows
	-----------
	           0

The astute reader will note that the where clause is over-general: because the primary key prevents two start dates from being equal for a given ID, the <= test can be simply <. For the same reason, the last and not is unnecessary. Simplified:

	select count(*) as nrows
	from so.B as b 
	where exists (
	  select 1 from so.B 
	  where ID = b.ID 
	  and b.StartDate < StartDate and StartDate < b.EndDate 
	)
	nrows
	-----------
	           0

By the way, when working with ranges, it helps to use the idiom

	A < B and B <= C

because not only does it mimic the mathematical notation

	A < B ≤ C

but it also emphasizes the range by putting the extremes at the edges.

OK, back to coalescing our dates. Having patrolled for duplicates, we can join tables A and B. Here's a naïve attempt:

	select a.ID
		, a.StartDate, b.StartDate as bstart
		, a.EndDate, b.EndDate as bend
		, a.Flag as A_Flag, b.Flag as B_Flag 
	from so.A as a join so.B as b 
	on a.StartDate <= b.StartDate and b.StartDate < a.EndDate

 ID          StartDate   bstart      EndDate     bend        A_Flag B_Flag 
 ----------- ----------- ----------- ----------- ----------- ------ ------
           1 Jan  1 2008 Jan 19 2008 Mar 23 2008 Feb 17 2008      1      1
           1 Jan  1 2008 Feb 17 2008 Mar 23 2008 Jun 15 2008      1      0
           1 Jun 15 2008 Jun 15 2008 Aug 18 2008 Aug 18 2008      1      1
  

Problems

Solution

Construct the set of all date ranges, and then join back to the tables to get the flags whose dates surround each range. First the SQL, then an explication:

	with D (ID, bound) as (
	    select   ID 
		   , case T when 's' then StartDate else EndDate end as bound
	    from  (
		select ID, StartDate, EndDate from so.A 
		UNION
		select ID, StartDate, EndDate from so.B
	    ) as U
	    cross join (select 's' as T union select 'e') as T
	)
	select P.*, a.Flag as A_Flag, b.Flag as B_Flag
	from (
	    select s.ID, s.bound as StartDate, min(e.bound) as EndDate
	    from D as s join D as e 
	    on s.ID = e.ID 
	    and s.bound < e.bound
	    group by s.ID, s.bound
	) as P
	left join so.A as a
	on  P.ID = a.ID 
	and a.StartDate <= P.StartDate and P.EndDate <= a.EndDate
	left join so.B as b
	on  P.ID = b.ID 
	and b.StartDate <= P.StartDate and P.EndDate <= b.EndDate
	order by P.ID, P.StartDate, P.EndDate
ID  StartDate  EndDate A_Flag B_Flag
1  Jan 1 2008  Jan 19 2008 1
1  Jan 19 2008  Feb 17 2008 1 1
1  Feb 17 2008  Mar 23 2008 1 0
1  Mar 23 2008  Jun 15 2008 0 0
1  Jun 15 2008  Aug 18 2008 1 1

Discussion

The with D clause starts a common table expression, which is purely a matter of convenience. The CTE acts as a temporary view; it can be referred to subsequently in the query as a view. If your DBMS doesn't support CTEs, you can create a real view, or just repeat the SQL as a virtual table.

Fine, but what's the CTE doing? It puts all the dates in one column. The crucial insight is that every pair of successive dates is a distinct interval. Except that an end date may never precede its start date, a period might exist for any permutation (4 items, choose 2):

Period Start Period End
A start A end
A start B start
A start B end
B start B end
B start A start
B start A end
A end B end
A end B start
B end A end
B end A start

The query begins with the list of dates in the CTE. Virtual table P joins the CTE to itself, getting the minimum (i.e. first) date after each. That produces a list of start-end date pairs.

From there, there's nothing fancy. Each minimal pair must “nest” inside the dates for both tables' ranges. We join P to A such that P's dates start after A's start date and end before A's end date. Then we do the same with B.

Performance

Someone will ask, but is it fast? It should be. Although there's quite a bit of SQL, it all operates on the same two tables, and mentions each table only twice. From the point of view of the DBMS, we query tables A and B, and sort their dates to create pairs. If StartDate is in the primary key (as it should be), most DBMSs will construct a suitable index, meaning that the query is always operating on indexed values. Those indexes are probably B+ trees, providing O(n log N) performance, which is about as good as it gets.

I've used queries like this on millions of rows with very acceptable performance. On occasion I've had to resort to materialized views or a temporary table. If you're writing queries like this one and it's too slow (by some definition) the first place to look is to your server hardware. DBMSs like RAM, and usually RAM is cheaper than your time. Get a fast enough server, normalize your tables, and write your queries based on the relational algebra. If you do those three things, you won't have performance problems. And you will have time to answer questions on Stack Overflow, instead of asking them.

Inserting some rows, updating the rest

In his first paper on the relational model, Codd spoke of “time-varying” data. Databases need to updated, and often those updates are a modification of what's already there, or partly already there. For some table T there is a set of rows S to be applied. For rows whose key is not in T, that means an INSERT. For rows already present in T, that means an UPDATE from S.

But how to tell which rows need an INSERT and which an UPDATE?

When first confronted with this issue, many people are tempted to fall back on row-by-row processing. Worse, read-test-write processing. Not only is it slow, it completely undermines the transactional guarantees of the DBMS. An algorithm like

	select C from T
	if (C > 1) {
		insert into T values (V)
	}
creates a “last writer wins” scenario: two processes might read the same row at the same time, each find C > 1, and each set it to the current value of V. If they're both processing the same set S, the result is very likely to be an inconsistent mix of the Vs from the two processes.

SQL defines the MERGE operator exactly for this purpose. MERGE is necessarily complicated, being multi-pronged as it is, but has the advantage of being atomic (as all SQL statements are). Not all DBMSs support MERGE, though, and the insert+update operation can be effected consistently — and almost atomically — without it.

Consider an update to the SP table. For supplier S3, the table holds just one row:

SP
SID PID QTY
S3 P2 200

We want to modify that row and add another for S3,

SP2
SID PID QTY
S3 P1 100
S3 P2 300

The solution is an existence test: two statements, INSERT and UPDATE, each one doing its part. INSERT those that do not exist, and UPDATE those that do.

BEGIN TRANSACTION
update SP
set QTY = (select QTY from SP2 
           where SP.SID = SP2.SID 
	     and SP.PID = SP2.PID)
where exists (
  select 1 from SP2
  where SP.SID = SP2.SID 
  and SP.PID = SP2.PID
);
insert into SP (SID, PID, QTY)
select SID, PID, QTY 
from SP2 
where not exists (
  select 1 from SP
  where SP.SID = SP2.SID 
  and SP.PID = SP2.PID
);
COMMIT TRANSACTION

select * from SP
where SID = 'S3';
SID PID QTY
S3 P1 100
S3 P2 300

Because the UPDATE and INSERT affect a disjoint sets — the UPDATE rows that do exist; the INSERT rows that don't exist — they can logically be done in either order. If the INSERT is done first, though, UPDATE will touch both sets: those that existed before the INSERT, and those that INSERT inserted. So it's usually more efficient to run the UPDATE first. However, as explained below, running UPDATE last is the only way to guarantee consistency.

Most DBMSs report the number of rows affected by any INSERT or UPDATE statement. If UPDATE affects all the rows in the source set, then it's safe to skip the INSERT. It's also safe not to skip it, because WHERE NOT EXISTS will return zero rows.

Note: The UPDATE syntax above is ANSI-standard. Many flavors of SQL sport a nonstandard UPDATE FROM variation that can be simpler to use, especially if more than one column is updated. Beware, though, that they're not always correct. Microsoft SQL Server has such a form, and when more than one source row matches a target row, they're all applied, one after the other. The DBMS will not report an error in that case, whereas the ANSI form will.

Consistency

As previously mentioned, this approach isn't perfectly atomic. But it's very close. Let's look at what might go wrong, and how to deal with it.

Consider two processes, P1 and P2, both executing the above procedure. You might think that's absolutely no problem, because P2 will have to wait once P1 begins the transaction. But that's not the case: the DBMS will lock some portion of the table, based on the particular rows affected, after the UPDATE commences. Depending on vendor and settings, the locked portion may only be the affected rows, or the whole table, or something in between.

If P1 and P2 UPDATE any of the same rows, the DBMS will force one to wait. The two transactions will run sequentially, and the results will be consistent. A little hard to predict, perhaps, but consistent.

If P1 and P2 are updating different rows, they might both succeed concurrently, and continue on to the INSERT. Here again, if they are inserting different rows, there may be some resource contention, but each INSERT will succeed, and both transactions will commit.

The trouble comes if there's an overlap in the rows P1 and P2 wish to insert. Suppose P1 goes first, and inserts its rows. P2 is constrained to wait until P1 commits. Then P2 inserts N rows, of which (say) two were among P1's set. The WHERE NOT EXISTS clause causes P2 to insert N-2 rows.

Does that matter? If P1 inserted the rows, shouldn't P2 be satisfied? Sometimes, perhaps, but not in general.

Suppose, for example, P1 and P2 each represent books being returned to a library. P1 returns Gone with the Wind. Because all copies had been lent out, none remained in the library, and the act of returning one required inserting a new row. Then P2 comes along with another copy of the same book. Does it suffice that P1 inserted the row? No, because P2's duty is to reflect its separate copy in the database, by incrementing the book count.

If P2 had, say, 12 rows in its SP2 input row set, and it updated 5 and inserted only 5, we know 2 rows from SP2 were not applied. We could ROLLBACK on that basis alone, or we could test for existence:

select 1 from SP2 
where not exists (
  select 1 from SP
  where SP.SID = SP2.SID
  and   SP.PID = SP2.PID
  and   SP.QTY = SP2.QTY
);

If that returns a row, something wasn't applied. If it doesn't, it means that P1 happened to insert the very same two rows P2 intended to insert.

There is another option, though: UPDATE. The rows are still in SP2. Re-running the UPDATE statement will apply the last two rows. The result will be consistent: all of P1's rows will have been applied, and all of P2's rows will have been applied, in that order.

Conclusion


Broken Queries

Columns with lists

Questions are frequently asked about table designs that are hopelessly wrong. The solution to the question is not to write the query, but to re-write the table, after which the query will practically write itself.

Perhaps the most egregious example is a column whose value is a list or, in SQL terms, a repeating group. The elements in the list are perhaps comma-separated, and some poor schlep has the task of selecting or joining on the the nth element in the list.

There are no such columns in the Suppliers & Parts database. Before we get into why that is, let's create an example of what the sp table would look like (leaving out qty for now).

Table with Repeating Groups
SID PIDS
S1 P1,P2,P3,P4,P5,P6
S2 P1,P2
S3 P2
S4 P2,P4,P5

This table is not in First Normal Form (1NF). The PIDs column doesn't specify one thing, but a list of things. A 1NF design eliminates such lists by putting each list element in a separate row, just as was done with the sp table.

Problems arising with tables not in 1NF

The answer to these questions is always redesign the table. Put the table in 1NF. Eliminate repeating groups and, with them, the problems they create.

At first blush, it may seem as though a “split” function, such as exists in programming languages such as Perl, is just the ticket. When will old-fashioned SQL catch up with the 21st century? But SQL gets the last laugh, because in addition to “split” we'd need all the functions already provided for rows: not only min, max, avg, and count, but the ability to join lists (and parts of lists) to rows in other tables (e.g. P). Not to mention inventing ways to enforce constraints such as those listed above.

Yes, it could be done, but nothing would be gained. List functionality exists in SQL already for tables (and table-like objects, such as views).


James K. Lowden March 2013.

Valid HTML 4.01 Strict