A SQL Server Correlated Subquery Gotcha

Posted by on Feb 26, 2011 in Geekery, programming, SQL | No Comments

Warning: exceptionally geeky post coming up. Those reading for Kindle reviews and other fluffy stuff will want to skip this one…

I was hanging out Stack Overflow the other day, and saw a question title that resonated with me: Column does not exist in the IN clause, but SQL runs. It resonated because I’d fallen victim to the same gotcha just the day before, for probably about the fifteenth time in a fifteen-year SQL Server career, though this gotcha isn’t limited to SQL Server.

Here’s an admittedly-contrived schema, relating TV shows and channels to genres:

TABLE: tv_show
 
show_id title          show_genre
======= ============== ==========
      1 Scrubs         comedy
      2 The Mentalist  drama
      3 CSI            drama
 
TABLE: tv_channel
 
channel_id channel_name    genre
========== =============== ======
         1 Comedy Central  comedy
         2 The Cop Channel drama

Given those two tables, I want to find out which channels are likely to show Scrubs. So, I write a query:

SELECT 
  * 
FROM 
  tv_channel 
WHERE 
  genre IN (
    SELECT genre FROM tv_show WHERE title = 'Scrubs'
  )

What does this query return? Well, this query returns the following result, perhaps unexpectedly, if you weren’t reading very carefully:

channel_id channel_name    genre
========== =============== ======
         1 Comedy Central  comedy
         2 The Cop Channel drama

Yup. It brings back all the rows in the tv_channel table. Buh? But Scrubs is a comedy, and The Cop Channel only shows drama, right? Right.

What we’ve got here is a failure to, er, correlate. If you haven’t spotted what’s going on by now, it might be enlightening to run just the subquery on its own:

SELECT genre FROM tv_show WHERE title = 'Scrubs'
 
Msg 207, Level 16, State 1, Line 1
Invalid COLUMN name 'genre'.

Has the penny dropped? Yes, dopily, I’ve used the wrong column name for the genre of a TV show. It should be show_genre, not genre. But hang on, if that’s the case, then how come the earlier query, which includes this invalid column name, even compiled, let alone ran and produced results?

And here’s the gotcha: Subqueries have access to all of the columns of the outer query. In a subquery, if you don’t qualify a column name with a table name, then SQL Server (and other SQL engines) will first look for that column name in the subquery, and if it doesn’t find it, it’ll look for it in the main query as well. Going back to our example:

SELECT 
  * 
FROM 
  tv_channel 
WHERE 
  genre IN (SELECT genre FROM tv_show WHERE title = 'Scrubs')

Look at the subquery. title isn’t qualified, so SQL Server has to figure out which table it’s for. In this case, it’s not got far to look — tv_show has a title column, so that’s what it uses.

Then look at genre. In this case, the genre column isn’t in the tv_show table, because I’ve made a dumb mistake. The tv_show column I wanted was show_genre, not genre. And this is where things get funky. Because we’re in a subquery, SQL Server doesn’t give up at that point and dump an error message. No, it looks up a level, to the main query. And in there, it finds a genre column, in the tv_channel table, and it uses that column. Because it’s perfectly valid SQL, and it might even have been what you were intending to do. SQL Server can’t tell.

So, the query we’ve written, fully qualified, effectively boils down to this:

SELECT 
  *
FROM
  tv_channel 
WHERE 
  tv_channel.genre IN (SELECT tv_channel.genre FROM tv_show WHERE tv_show.title = 'Scrubs')

Yah. SQL Server will now be comparing the tv_channel.genre column with itself in the ‘IN‘ clause.

The main query does a SELECT * FROM tv_channel.... So SQL Server will look through the rows in the tv_channel table. For each row, it looks at the results of the WHERE clause, to decide whether the row should be included in the final results. For each row, what it’s checking is:

...WHERE tv_channel.genre IN (SELECT tv_channel.genre FROM tv_show WHERE tv_show.title = 'Scrubs')

Which boils down to: If there are any TV shows at all with the title Scrubs, then TRUE, else FALSE.

In fact, because the column we’re comparing with our IN clause is the same in both cases, we might as well be asking:

SELECT * FROM tv_channel WHERE 'wibble' IN (SELECT 'wibble' FROM tv_show WHERE tv_show.title = 'Scrubs')

You’ll get the same result back either way — all rows from tv_channel if Scrubs is in the tv_show table at all, none of them if it’s not.

This kind of gotcha can produce some really freaky results, depending on the exact mistake you’ve made. Especially if it’s a big query with lots of joins and this mistake is just a tiny part of it that you’re expecting either to work, or to give you a syntax error.

So, the moral. Try to get this pattern in your head: if you’re getting frustrated debugging a query with an IN-claused subquery, the first quick diagnostic to try is to run the subquery on its own, and see if you get any unexpected errors. The next is to fully qualify the column names in the subquery, just in case.

Either way you might save yourself a lot of time and trouble figuring out why you’re getting really odd results.

« Pretty Poor Security, Apple! | How a Live Pet Project Can Help You Learn »