Adventures In Foreign Keys 4: How Should I Index These Things?

This week, we’re all about foreign keys. So far, we set up the Stack Overflow database to get ready, then tried to set up relationships, and encountered cascading locking issues.

Dawn Of The Data

I don’t know how long the recommendation to index your foreign keys has been a thing, but I generally find it useful to abide by, depending a bit on how they’re used.

Without foreign key indexes, when SQL Server needs to validate data, or cascade actions, well, this could end up being pretty inefficient for large data sets. I’m not saying it’s the end of the world, but it’s an avoidable problem.

There’s also been this recommendation to create single key column indexes on your foreign keys, and to keep them around even if it doesn’t look like they’re being used. I’ve always found that curious. Do you really need that, if you’ve got a multi-column index that leads with the foreign key column?

That’s what I wanted to find out when I started writing this post.

Starting Point

To make things relatively easy, we’re going to use a single key column foreign key between Users and Badges.

Why? Because they’re two of the smaller tables, so it takes less time to add different things.

This is how most decisions get made. Don’t act disappointed.

ALTER TABLE dbo.Badges 
    ADD CONSTRAINT fk_badges_users_id FOREIGN KEY (UserId) 
        REFERENCES dbo.Users(Id) ON DELETE CASCADE;

I’m going to start with no nonclustered indexes, and then I’m going to create some with different columns, and column orders. The point here is that we don’t have an index just on the foreign key column, but we do have an index that contains it.

No Indexes

First, let’s test a query that should be eligible for join elimination. Do we need indexed foreign keys for that?

SELECT b.UserId, 
       COUNT(*) AS records
FROM dbo.Users AS u
JOIN dbo.Badges AS b
ON u.Id = b.UserId
GROUP BY b.UserId
HAVING COUNT(*) > 1000;

Here’s the query plan:

Hmm.

Well, it’s not that this query couldn’t benefit from some index help, but we totally eliminate the join to Users. That’s a good sign, though. Our foreign key is strong. Strong like lab rat.

We Learned Something

Even with no supporting indexes, we can still get join elimination. And this makes total sense. The constraint is there, and it’s trusted.

Maybe the advice to always index foreign keys is more like “usually” or “it depends”. If we don’t need to index foreign keys for join elimination, what do we need them for? Let’s look at a query that doesn’t get join elimination.

We’re going to break that optimization by selecting a column from the Users table. When we need data from a table, the join can’t be eliminated.

We’re also going to filter on the Date column in the Badges table.

SELECT b.UserId,
       u.Reputation,
       COUNT(*) AS records
FROM dbo.Users AS u
JOIN dbo.Badges AS b
ON u.Id = b.UserId
WHERE b.Date >= '20120101'
GROUP BY b.UserId, u.Reputation
HAVING COUNT(*) > 1000;

Here’s the query plan:

PEEKABOO

This query runs in about a second On My Machine®

Is it the greatest plan of all time? No. Could an index help? Yes.

But would I count on an index just on my foreign key column being a huge performance win? No, and SQL Server is with me on this. When I tested that, the optimizer totally ignored a single column index on UserId. We got the same plan as above. Let’s throw a couple wider indexes into the mix, and see which one the optimizer picks.

CREATE INDEX ix_apathy ON dbo.Badges(UserId, Date) INCLUDE(Name);
CREATE INDEX ix_ennui ON dbo.Badges(Date, UserId) INCLUDE(Name);

Here’s the query plan:

Songs About Internet Explorer

We use the index on Date, UserId. The wider index is helpful, here.

Summary So Far

We don’t need to index foreign keys for join elimination
A single column index on the foreign key was ignored by the optimizer
We needed to add a wider index that was more helpful to our query

It seems like the wiser choice after these experiments is to index for your queries, not your keys. Queries, after all, tend to be more complicated than keys, and need more indexing support.

Whatabouts

I know, I know. Modifications. Cascading things.

Here’s the thing: For a single row insert, I couldn’t get any permutation to behave much worse than the others.

BEGIN TRAN
INSERT dbo.Badges ( Name, UserId, Date )
SELECT 'WICKED AWESOME THING', 22656, GETDATE()
ROLLBACK

For a larger insert of about 20k rows, well, it was a lot like other modifications. It got a little worse with each index I added — including the single column index on UserId.

One indexes:

--CREATE INDEX ix_ennui ON dbo.Badges(Date, UserId) INCLUDE(Name);
Table 'Badges'. Scan count 0, logical reads 10040, physical reads 0, read-ahead reads 20, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Users'. Scan count 1, logical reads 44477, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 51, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 500 ms,  elapsed time = 521 ms.

Two indexes:

--CREATE INDEX ix_apathy ON dbo.Badges(UserId, Date) INCLUDE(Name);
--CREATE INDEX ix_ennui ON dbo.Badges(Date, UserId) INCLUDE(Name);
Table 'Users'. Scan count 0, logical reads 43099, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Badges'. Scan count 0, logical reads 149526, physical reads 0, read-ahead reads 30, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 43175, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 51, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 641 ms,  elapsed time = 669 ms.

Three indexes:

Table 'Users'. Scan count 0, logical reads 43099, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Badges'. Scan count 0, logical reads 192433, physical reads 0, read-ahead reads 21, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 3, logical reads 43372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 51, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 625 ms,  elapsed time = 646 ms.

With every extra index, we do a lot more work on Badges, and the Worktable used to spool rows throughout the plan. We don’t do any more on reads on Users or Posts. This is just like other data modifications: The more indexes you have, the more overhead you have.

Quadruples!

Chasing Waterfalls

The same thing goes for when modifications cascade. With more indexes, you have more overhead. In this case, with a delete, we now have to remove data from four indexes.

Piggy.

Let’s Get Out Of Here

In general, I don’t find single column indexes very appealing. They’re just not terribly helpful to many queries. I can’t recall seeing too many queries that just selected one column from one table, with some relational operation on just that one column. And when I have, most of the time it was already the PK/CX of the table anyway.

So, do you have to create an index on all your foreign key columns, and just your foreign key columns, and keep them regardless of if they’re used?

It doesn’t look like it to me. I’d much rather have people index for their queries than index just to satisfy a “rule”.

Just as an example, you have a table with a high rate of data modification, and you…