jump to table of contents
If you spend much time online in places where SQL is discussed and questions about SQL are asked, and especially if you have more than a passing acquaintance with the relational model, you'll soon notice that many of the questions are the same. They go like this:
I am programming in environment E using tools T for application Y. I have tables A and B with N thousand rows and need to do X with the database. It has to be fast and efficient. What is the best way?
This page is about operation X. You can plug in your favorite values for E, T, Y, A, B, and N.
The question doesn't seem commonplace to the person asking; it seems very specific, very special. The fact is, though, that every such question is of one of two kinds:
The first kind of question lies outside the relational model. Some questions leave very few clues about the data and the hoped-for result. Some tables were designed without regard for the advantages of BCNF normal form. Some questions aren't about SQL at all, really, but about the programming environment. The second kind can be answered with a generic example if the person asking it can make the mental leap of substituting his tables and columns for those in the example.
This page is devoted to the person who understands that every query he writes has been written before in some form, that a query is nothing more or less than a logical construct, an equation. Distilled from the application, every select is a demand of the DBMS: Join these tables on these columns subject to these constraints.
If you have have come to that realization, you're halfway there. If you can state the problem algebraically in terms of sets, then perhaps an example can be adapted to your purpose. Perhaps, too, the possibilities posed by the examples you find here will influence your design. The ability to derive data using select reduces the need to maintain them with update.
It's fashionable these days to speak of programming patterns. I resist the temptation to join that parade. SQL, for all its defects, is based on math and can be thought of in mathematical terms. In centuries of mathematics, no one has ever suggested that equations have “patterns”. The mathematical pattern, if you will, is to use operators and operands to formulate new derivations. Because SQL rests on a mathematical foundation (unlike, say, object orientation) we can discuss it in mathematical terms instead of programming terms.
That said, there are patterns and anti-patterns when it comes to database application programming. The programming languages we use today have nothing like the expressive power of SQL, and most DBMS access libraries and frameworks succeed best at reaching for the lowest common denominator. Rather than exploiting the DBMS's capabilities, they thwart it. When it comes to Object Relational Management frameworks, the pattern is the anti-pattern. If you are experiencing performance problems with your DBMS and you're using some kind of framework that relieves you from writing your own SQL, chances are the problem stems not from the database design, but from the application design.
You might even call that a pattern.
The organization of the examples is a work in progress. SQL tutorials often begin with a simple select statement from one table, then to introduce where, then join and so on. But we don't need to start at the very beginning.
The best solution I've come up with so far is to write answers to what I see as Frequently Asked Questions. As the library of answers accumulates, perhaps some system of organization will suggest itself. So far, it's a matter of improvisation.
This page provides examples of SQL queries that answer questions frequently posted on the Web. The sample schema and data are the same as those used by C.J. Date in his popular textbook, An Introduction to Database Systems, now in its 8th edition. Please refer to Suppliers and Parts [pdf] for a description of the tables and data.
To say anything about a row relative to other rows in the same table, it is necessary to do two things:
Self-joins — where a table is joined to itself — is usually presented as a special case, and on first encounter it often seems weird. But it's not, any more than 2 + 2 = 4 or y = x + x is weird. join is a binary operator (that is, it operates on two tables). Nothing says those two tables can't be the same one.
This question takes two forms:
select count(lesser.SID) as RANK, S.SID from S as S left outer join S as lesser on S.SID > lesser.SID group by S.SID order by S.SID;
An outer join is required to accommodate the first row, which has no lesser.
Show accumulated Y as X increases (usually over time). In this example we accumulate STATUS as SID increases.
select S.SID, S.STATUS, sum(totals.STATUS) as 'TOTAL' from S as S join S as totals on S.SID >= totals.SID group by S.SID order by S.SID;
Are there any news?
join answers the question, which rows in table B meet criteria in table A. Often, though, we don't care about which rows, but only whether any rows match, a/k/a an existence test. For that, SQL has the operator exists.
exists comes from relational calculus. It's formulated as a correlated subquery. Many people find the correlated subquery syntax odd and shy away from it, preferring good old trustworthy join. Others are under the mistaken impression that correlated subqueries are slow, something that was once true of many implementations, and is less often the case today.
My advice: never send a join to do an exists job.
Why use exists? First, it embodies the question as you might phrase it logically, e.g. show rows in A where value V appears in B or show users that logged on last night. Second, it avoids duplicates. When we join A to B, the A columns are repeated for every matching B row. When we use exists, we get only as many rows in A as are there, regardless of how many (more than one) appear in B.
This might be expressed as Show parts with a supplier in the same city. A join on CITY would produce too many rows: we don't want to show all supplier-part pairs, but just those parts that have any supplier in the same city.
select * from S where exists ( select 1 from P where CITY = S.CITY );
This might be expressed as Show parts with no supplier in the same city.
select * from S where not exists ( select 1 from P where CITY = S.CITY );
For example, Show the supplier with the largest quantity of each part.
select PID, SID, QTY from SP as quantities where exists ( select 1 from SP where PID = quantities.PID group by PID having max(QTY) = quantities.QTY ) order by PID, SID;
Placing the order-by columns first in the select clause makes it easier to spot duplicates or missing rows.
Note the “tie for first” for P1. Sometimes ties are OK; sometimes additional criteria need to be applied. To filter the results yet further, convert the base query into a view — or a temporary view, a/k/a “Common Table Expression” — and apply the same where-exists logic to the view.
Time-series data deserve honorable mention for the database design and query issues they raise. Time-series data
Each of these aspects presents challenges. It's not uncommon to find TS tables with millions, or hundreds of millions, or even billions of rows. Pricing history on the stock exchange, for instance, accumulates at the rate of millions per second.
The question of when needs to be clearly defined. The high temperature yesterday (at 3:00 PM, in Central Park) was 36° Fahrenheit. That's true of a point in time. The accumulated rainfall this month as of today is 1.5 inches. That's true as of a point in time for a time interval i.e. month-to-date.
These difference are sometimes manifested in the table design. Point-in-time data need only one column to describe when. Time interval data need two, start time and end time. If end time is not yet known, it must be able to represent now somehow, either by a special value or by no value i.e. NULL.
Because the Suppliers & Parts database has no dates, we're forced to use something else. This example comes from Stack Overflow.
* Primary key columns are underlined and highlighted.
(It may bear noting here that the end date need not be stored if it can always be computed from the next latest start date. In this example, most of the
EndDate values are redundant.)
The question is, how to show the unique values of flags A and B over time?
That raises another question: what does end date mean? Sometimes it's an inclusive date i.e. up to and including that date. Better — easier to program and easier to manipulate in the database — is an exclusive end date. An exclusive end date matches the next start date, making for a simple and efficient join. In this example we can infer the end date exclusive because several end dates match other start dates.
Before we go too far, let's make sure the data are internally consistent. In each table,
EndDateshould always come after
StartDate(on the same row), and
StartDateshould ever fall between any other
StarDate-EndDatepair, because that would constitute overlapping dates with potentially different non-key values.
To verify that constraint holds — that no row in either table represents a time period overlapping any other row — test for existence:
select count(*) as nrows from so.B as b where exists ( select 1 from so.B where ID = b.ID and b.StartDate <= StartDate and StartDate < b.EndDate -- not the very same record and not (StartDate = b.StartDate and EndDate = b.EndDate) ) nrows ----------- 0
The astute reader will note that the where clause is over-general: because the primary key prevents two start dates from being equal for a given ID, the <= test can be simply <. For the same reason, the last
and not is unnecessary. Simplified:
select count(*) as nrows from so.B as b where exists ( select 1 from so.B where ID = b.ID and b.StartDate < StartDate and StartDate < b.EndDate ) nrows ----------- 0
By the way, when working with ranges, it helps to use the idiom
A < B and B <= C
because not only does it mimic the mathematical notation
A < B ≤ C
but it also emphasizes the range by putting the extremes at the edges.
OK, back to coalescing our dates. Having patrolled for duplicates, we can join tables A and B. Here's a naïve attempt:
select a.ID , a.StartDate, b.StartDate as bstart , a.EndDate, b.EndDate as bend , a.Flag as A_Flag, b.Flag as B_Flag from so.A as a join so.B as b on a.StartDate <= b.StartDate and b.StartDate < a.EndDate ID StartDate bstart EndDate bend A_Flag B_Flag ----------- ----------- ----------- ----------- ----------- ------ ------ 1 Jan 1 2008 Jan 19 2008 Mar 23 2008 Feb 17 2008 1 1 1 Jan 1 2008 Feb 17 2008 Mar 23 2008 Jun 15 2008 1 0 1 Jun 15 2008 Jun 15 2008 Aug 18 2008 Aug 18 2008 1 1
Construct the set of all date ranges, and then join back to the tables to get the flags whose dates surround each range. First the SQL, then an explication:
with D (ID, bound) as ( select ID , case T when 's' then StartDate else EndDate end as bound from ( select ID, StartDate, EndDate from so.A UNION select ID, StartDate, EndDate from so.B ) as U cross join (select 's' as T union select 'e') as T ) select P.*, a.Flag as A_Flag, b.Flag as B_Flag from ( select s.ID, s.bound as StartDate, min(e.bound) as EndDate from D as s join D as e on s.ID = e.ID and s.bound < e.bound group by s.ID, s.bound ) as P left join so.A as a on P.ID = a.ID and a.StartDate <= P.StartDate and P.EndDate <= a.EndDate left join so.B as b on P.ID = b.ID and b.StartDate <= P.StartDate and P.EndDate <= b.EndDate order by P.ID, P.StartDate, P.EndDate
|1||Jan 1 2008||Jan 19 2008||1|
|1||Jan 19 2008||Feb 17 2008||1||1|
|1||Feb 17 2008||Mar 23 2008||1||0|
|1||Mar 23 2008||Jun 15 2008||0||0|
|1||Jun 15 2008||Aug 18 2008||1||1|
with D clause starts a common table expression, which is purely a matter of convenience. The CTE acts as a temporary view; it can be referred to subsequently in the query as a view. If your DBMS doesn't support CTEs, you can create a real view, or just repeat the SQL as a virtual table.
Fine, but what's the CTE doing? It puts all the dates in one column. The crucial insight is that every pair of successive dates is a distinct interval. Except that an end date may never precede its start date, a period might exist for any permutation (4 items, choose 2):
|Period Start||Period End|
|A start||A end|
|A start||B start|
|A start||B end|
|B start||B end|
|B start||A start|
|B start||A end|
|A end||B end|
|A end||B start|
|B end||A end|
|B end||A start|
The query begins with the list of dates in the CTE. Virtual table P joins the CTE to itself, getting the minimum (i.e. first) date after each. That produces a list of start-end date pairs.
From there, there's nothing fancy. Each minimal pair must “nest” inside the dates for both tables' ranges. We join P to A such that P's dates start after A's start date and end before A's end date. Then we do the same with B.
Someone will ask, but is it fast? It should be. Although there's quite a bit of SQL, it all operates on the same two tables, and mentions each table only twice. From the point of view of the DBMS, we query tables A and B, and sort their dates to create pairs. If StartDate is in the primary key (as it should be), most DBMSs will construct a suitable index, meaning that the query is always operating on indexed values. Those indexes are probably B+ trees, providing O(n log N) performance, which is about as good as it gets.
I've used queries like this on millions of rows with very acceptable performance. On occasion I've had to resort to materialized views or a temporary table. If you're writing queries like this one and it's too slow (by some definition) the first place to look is to your server hardware. DBMSs like RAM, and usually RAM is cheaper than your time. Get a fast enough server, normalize your tables, and write your queries based on the relational algebra. If you do those three things, you won't have performance problems. And you will have time to answer questions on Stack Overflow, instead of asking them.
Questions are frequently asked about table designs that are hopelessly wrong. The solution to the question is not to write the query, but to re-write the table, after which the query will practically write itself.
Perhaps the most egregious example is a column whose value is a list or, in SQL terms, a repeating group. The elements in the list are perhaps comma-separated, and some poor schlep has the task of selecting or joining on the the nth element in the list.
There are no such columns in the Suppliers & Parts database. Before we get into why that is, let's create an example of what the sp table would look like (leaving out qty for now).
This table is not in First Normal Form (1NF). The
PIDs column doesn't specify one thing, but a list of things. A 1NF design eliminates such lists by putting each list element in a separate row, just as was done with the sp table.
The answer to these questions is always redesign the table. Put the table in 1NF. Eliminate repeating groups and, with them, the problems they create.
At first blush, it may seem as though a “split” function, such as exists in programming languages such as Perl, is just the ticket. When will old-fashioned SQL catch up with the 21st century? But SQL gets the last laugh, because in addition to “split” we'd need all the functions already provided for rows: not only min, max, avg, and count, but the ability to join lists (and parts of lists) to rows in other tables (e.g. P). Not to mention inventing ways to enforce constraints such as those listed above.
Yes, it could be done, but nothing would be gained. List functionality exists in SQL already for tables (and table-like objects, such as views).
James K. Lowden March 2013.