A SQL Server Correlated Subquery Gotcha
Warning: exceptionally geeky post coming up. Those reading for Kindle reviews and other fluffy stuff will want to skip this one…
I was hanging out Stack Overflow the other day, and saw a question title that resonated with me: Column does not exist in the IN clause, but SQL runs. It resonated because I’d fallen victim to the same gotcha just the day before, for probably about the fifteenth time in a fifteen-year SQL Server career, though this gotcha isn’t limited to SQL Server.
Here’s an admittedly-contrived schema, relating TV shows and channels to genres:
Table: tv_show
show_id title show_genre
======= ============== ==========
1 Scrubs comedy
2 The Mentalist drama
3 CSI drama
Table: tv_channel
channel_id channel_name genre
========== =============== ======
1 Comedy Central comedy
2 The Cop Channel drama
Given those two tables, I want to find out which channels are likely to show Scrubs. So, I write a query:
SELECT
*
FROM
tv_channel
WHERE
genre IN (
SELECT genre FROM tv_show WHERE title = 'Scrubs'
)
What does this query return? Well, this query returns the following result, perhaps unexpectedly, if you weren’t reading very carefully:
channel_id channel_name genre
========== =============== ======
1 Comedy Central comedy
2 The Cop Channel drama
Yup. It brings back all the rows in the tv_channel
table. Buh? But Scrubs is a comedy, and The Cop Channel only shows drama, right? Right.
What we’ve got here is a failure to, er, correlate. If you haven’t spotted what’s going on by now, it might be enlightening to run just the subquery on its own:
SELECT genre FROM tv_show WHERE title = 'Scrubs'
Msg 207, Level 16, State 1, Line 1
Invalid column name 'genre'.
Has the penny dropped? Yes, dopily, I’ve used the wrong column name for the genre of a TV show. It should be show_genre
, not genre
. But hang on, if that’s the case, then how come the earlier query, which includes this invalid column name, even compiled, let alone ran and produced results?
And here’s the gotcha: Subqueries have access to all of the columns of the outer query. In a subquery, if you don’t qualify a column name with a table name, then SQL Server (and other SQL engines) will first look for that column name in the subquery, and if it doesn’t find it, it’ll look for it in the main query as well. Going back to our example:
SELECT
*
FROM
tv_channel
WHERE
genre IN (SELECT genre FROM tv_show WHERE title = 'Scrubs')
Look at the subquery. title
isn’t qualified, so SQL Server has to figure out which table it’s for. In this case, it’s not got far to look — tv_show
has a title
column, so that’s what it uses.
Then look at genre
. In this case, the genre
column isn’t in the tv_show
table, because I’ve made a dumb mistake. The tv_show
column I wanted was show_genre
, not genre
. And this is where things get funky. Because we’re in a subquery, SQL Server doesn’t give up at that point and dump an error message. No, it looks up a level, to the main query. And in there, it finds a genre
column, in the tv_channel
table, and it uses that column. Because it’s perfectly valid SQL, and it might even have been what you were intending to do. SQL Server can’t tell.
So, the query we’ve written, fully qualified, effectively boils down to this:
SELECT
*
FROM
tv_channel
WHERE
tv_channel.genre IN (SELECT tv_channel.genre FROM tv_show WHERE tv_show.title = 'Scrubs')
Yah. SQL Server will now be comparing the tv_channel.genre column with itself in the ‘IN‘ clause.
The main query does a SELECT * FROM tv_channel...
. So SQL Server will look through the rows in the tv_channel
table. For each row, it looks at the results of the WHERE clause, to decide whether the row should be included in the final results. For each row, what it’s checking is:
...WHERE tv_channel.genre IN (SELECT tv_channel.genre FROM tv_show WHERE tv_show.title = 'Scrubs')
Which boils down to: If there are any TV shows at all with the title Scrubs, then TRUE, else FALSE.
In fact, because the column we’re comparing with our IN
clause is the same in both cases, we might as well be asking:
SELECT * FROM tv_channel WHERE 'wibble' IN (SELECT 'wibble' FROM tv_show WHERE tv_show.title = 'Scrubs')
You’ll get the same result back either way — all rows from tv_channel
if Scrubs is in the tv_show
table at all, none of them if it’s not.
This kind of gotcha can produce some really freaky results, depending on the exact mistake you’ve made. Especially if it’s a big query with lots of joins and this mistake is just a tiny part of it that you’re expecting either to work, or to give you a syntax error.
So, the moral. Try to get this pattern in your head: if you’re getting frustrated debugging a query with an IN
-claused subquery, the first quick diagnostic to try is to run the subquery on its own, and see if you get any unexpected errors. The next is to fully qualify the column names in the subquery, just in case.
Either way you might save yourself a lot of time and trouble figuring out why you’re getting really odd results.