Quantcast
Channel: Erik Darling – Brent Ozar Unlimited®
Viewing all 370 articles
Browse latest View live

Learning How To Learn: Setting Priorities

$
0
0

Learning is hard

I hate command lines.

I hate command lines.

It’s rare that I get things right the first, or even tenth time. I have a horrible memory. Seriously. Most of the time I wouldn’t know what day of the week it is if it weren’t printed on my vitamin case. When it comes to SQL, especially commands or complicated syntax, I can only remember concepts. It’s rare that I don’t have to refer to notes or search for things. Ask me how many times I’ve restored a database and had to move files to different drives.

A thousand?

Probably.

But I can’t remember the the with/move/whatever commands.

What was I saying? Oh yeah. Learning.

Right now I’m putting my best feet forward and trying to learn more about Availability Groups, Oracle’s database platform, and R.

At the same time I’m trying to keep up with the latest and greatest from SQL Server. I also have a full time job and a wife and kid.

In case you’re wondering: yes, my blood type is Cafe Bustelo.

Prioritizing is key

Being a consultant, I have to know a lot about a lot of things. I don’t know what problems a client is going to come to us with. I also don’t know what the root cause of that problem is going to be. It’s a good thing I work with such smart people.

So how do I choose what I want to pursue next? I categorize things into buckets:

  • Current
  • Future
  • What-if

Current is stuff I have to know to stay good at what I’m doing now.

Future is stuff I have to know to stay ahead of where SQL Server is going.

What-if is what I want to know if SQL Server ever goes the way of white jeans.

That’s why training is great!

Our in-person and in-video training is a great mix of current and future. You need to know more about SQL Server, and you need the important stuff front and center.

Right now there’s SO MUCH to be excited about with SQL Server, and even more to learn. 2016 is going to introduce a lot of new features, and like most new things, there are going to be problems and limitations. Columnstore indexes finally look ready for the main stage, and Availability Groups are coming to Standard Edition. And of course, the Query Store will be upon us with a rather interesting limitation.

Now if only people would install it…

Thanks for reading!

Brent says: take care of the current issues on your Database Hierarchy of Needs, and then you’ll feel much more comfortable taking the time to learn the future and what-if stuff. This is what I get the most excited about around consulting – we get to take the time to learn stuff before folks start to deploy it.


Unique Indexes and Row Modifications: Weird

$
0
0

Confession time

This started off with me reading a blurb in the release notes about SQL Server 2016 CTP 3.3. The blurb in question is about statistics. They’re so cool! Do they get fragmented? NO! Stop trying to defragment them, you little monkey.

Autostats improvements in CTP 3.3
Previously, statistics were automatically recalculated when the change exceeded a fixed threshold. As of CTP 3.3, we have refined the algorithm such that it is no longer a fixed threshold, but in general will be more aggressive in triggering statistics scans, resulting in more accurate query plans.

I got unnaturally excited about this, because it sounds like the behavior of Trace Flag 2371. Anyone who has taken a bite out of a terabyte database probably knows about this one. Ever try waiting for statistics to automatically update on a billion row table? You’re gonna need a crate of Snickers bars. I’m still going to write about the 2016 stuff, but I caught something weird when I was working on a way to demonstrate those thresholds. And that something was how SQL tracks modifications to unique indexes. It freaked me out for, like, days.

We’re gonna need a couple tables

But they’ll be slightly different. It’s the only way to really show you how weird it gets inside SQL’s head.

Table 1 has a clustered PK on the ID column. It has a non-unique, nonclustered index on DateFiller and TextFiller.

IF OBJECT_ID('[dbo].[Nuisance]') IS NOT NULL
   DROP TABLE [dbo].[Nuisance];
GO 

CREATE TABLE [dbo].[Nuisance]
       (
         [ID] BIGINT NOT NULL ,
         [DateFiller] DATETIME2 DEFAULT SYSDATETIME() NOT NULL ,
         [TextFiller] VARCHAR(10) DEFAULT 'A' NOT NULL
       );

ALTER TABLE [dbo].[Nuisance] ADD CONSTRAINT [PK_Nuisance] PRIMARY KEY CLUSTERED ([ID]);

CREATE NONCLUSTERED INDEX [ix_Nuisance] ON [dbo].[Nuisance] ([DateFiller], [TextFiller])

Table 2 has the same structure, but the clustered PK is on ID and DateFiller. Same nonclustered index, though.

IF OBJECT_ID('[dbo].[Nuisance2]') IS NOT NULL
   DROP TABLE [dbo].[Nuisance2];
GO 

CREATE TABLE [dbo].[Nuisance2]
       (
         [ID] BIGINT NOT NULL ,
         [DateFiller] DATETIME2 DEFAULT SYSDATETIME() NOT NULL ,
         [TextFiller] VARCHAR(10) DEFAULT 'A' NOT NULL
       );

ALTER TABLE [dbo].[Nuisance2] ADD CONSTRAINT [PK_Nuisance2] PRIMARY KEY CLUSTERED ([ID], [DateFiller]);

CREATE NONCLUSTERED INDEX [ix_Nuisance2] ON [dbo].[Nuisance2] ([DateFiller], [TextFiller])

All this code works, I swear. Let’s drop a million rows into each.

INSERT  [dbo].[Nuisance] WITH ( TABLOCK )
        ( [ID] ,
          [DateFiller] ,
          [TextFiller] )
SELECT TOP 1000000
    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) ,
    DATEADD(SECOND, [sm1].[message_id], SYSDATETIME()) ,
    SUBSTRING([sm1].[text], 0, 9)
FROM
    [sys].[messages] AS [sm1] ,
    [sys].[messages] AS [sm2] ,
    [sys].[messages] AS [sm3]; 

INSERT  [dbo].[Nuisance2] WITH ( TABLOCK )
        ( [ID] ,
          [DateFiller] ,
          [TextFiller] )
SELECT TOP 1000000
    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) ,
    DATEADD(SECOND, [sm1].[message_id], SYSDATETIME()) ,
    SUBSTRING([sm1].[text], 0, 9)
FROM
    [sys].[messages] AS [sm1] ,
    [sys].[messages] AS [sm2] ,
    [sys].[messages] AS [sm3];

Now let’s take a basic look at what’s going on in our indexes and statistics. We just created tables! And inserted a million rows! Each! That has to count for something, right? Here’s a query to check that kind of thing.

SELECT
    [t].[name] AS [table_name] ,
    [si].[name] AS [index_name] ,
    [si].[dpages] AS [data_pages] ,
    [si].[rowcnt] AS [index_row_count] ,
    [si].[rows] AS [index_rows] ,
    [ddsp].[rows] AS [stats_rows] ,
    [ddsp].[rows_sampled] AS [stats_rows_sampled] ,
    [si].[rowmodctr] AS [index_row_modifications] ,
    [ddsp].[modification_counter] AS [stats_modification_counter] ,
    [ddsp].[last_updated] AS [last_stats_update]
FROM
    [sys].[sysindexes] [si]
JOIN [sys].[stats] [s]
ON  [si].[id] = [s].[object_id]
    AND [si].[indid] = [s].[stats_id]
JOIN [sys].[tables] [t]
ON  [t].[object_id] = [si].[id]
CROSS APPLY [sys].[dm_db_stats_properties]([s].[object_id], [s].[stats_id]) AS [ddsp]
WHERE
    [t].[name] LIKE 'Nuisance%'
ORDER BY
    [t].[name] ,
    [si].[indid];

Holy heck why don’t we have any statistics? The indexes tracked our million modifications from the insert, but the statistics aren’t showing us anything. They’re all NULL! Right now, SQL has no idea what’s going on in here.

Empty inside

Empty inside

At least, until it has to. If we ran a query with a WHERE clause, an initial statistics update would fire off. Hooray. SQL is lazy. We can skip all that fuss and just update manually. I want a FULLSCAN! No fullscan, no peace. Or something.

UPDATE STATISTICS [dbo].[Nuisance] WITH FULLSCAN;
UPDATE STATISTICS [dbo].[Nuisance2] WITH FULLSCAN;

If we go back to our DMV query, the stats columns will at least not be NULL now. It will show 1,000,000 rows sampled, and no modifications, and the last stats update column will have a date in it. Wonderful. You don’t need a picture of that. Conceptualize. Channel your inner artist.

Weir it all gets whered

Let’s think back to our indexes.

  • Nuisance has the clustered PK on ID
  • Nuisance2 has the clustered PK on ID, DateFiller
  • They both have non-unique nonclustered indexes on DateFiller, TextFiller

One may posit, then, that they could let their workloads run wild and free, and that SQL would dutifully track modifications, and trigger automatic updates when necessary. This is being run on 2014, so we don’t expect the dynamic threshold stuff. The rule that applies to us here, since our table is >500 rows, is that if 20% of the table + 500 rows changes, SQL will consider the statistics stale, and trigger an update the next time a query runs against the table, and uses those statistics.

But, but, but! It does not treat all modifications equally. Let’s look at some examples, and then buckle in for the explanation. No TL;DR here. You must all suffer as I have suffered.

We’ll start with an update of the nonclustered index on Nuisance.

UPDATE
    [n]
SET
    [n].[DateFiller] = DATEADD(MICROSECOND, 1, [n].[DateFiller]),
  [n].[TextFiller] = REPLACE(n.[TextFiller], ' ', '')
FROM
    [dbo].[Nuisance] AS [n]
WHERE
    [n].[ID] >= 1 AND ID <= 100000
  AND 
  [n].[DateFiller] >= '0001-01-01' AND [n].[DateFiller] <= '9999-12-31'

SELECT @@ROWCOUNT AS [Rows Modified]

We use @@ROWCOUNT to verify the number of rows that were updated in the query. Got it? Good. It should show you that 100,000 rows were harmed during the filming of that query. Poor rows.

Here’s the execution plan for it. Since we don’t have a kajillion indexes on the table, we get a narrow plan. There are some compute scalars to come up with the date adding, the replace, and the predicates in our WHERE clause. It’s all in the book. You should get the book.

ACTUAL EXE-CUTIE-PIE

ACTUAL EXE-CUTIE-PIE

At this point, if you run the DMV query, you should see 100,000 modifications to the nonclustered index on Nuisance. Not enough to trigger an update, but we don’t care about that in this post. It makes sense though, right? We updated 100k rows, SQL tracked 100k modifications.

What if we run the same update on Nuisance2? We still only update 100k rows, but our execution plan changes a little bit…

Split! Sort! Collapse! Fear! Fire! Foes!

Split! Sort! Collapse! Fear! Fire! Foes!

And now we have TWO HUNDRED THOUSAND MODIFICATIONS?

What in the wide world of sports?

What in the wide world of sports?

This is how SQL handles updates on columns with unique constraints, which we’ll get to. But let’s look at a couple other updates first!

UPDATE
    [n]
SET
    [n].[ID] += 1
FROM
    [dbo].[Nuisance] AS [n]

SELECT @@ROWCOUNT AS [Rows Modified]

If we go back and update just the ID column of Nuisance, something really cool happens.

Two is the loneliest number

Two is the loneliest number

It only took two modifications to update one million rows in the clustered index. We still had to update all million rows of the nonclustered index (+1, I’m guessing, to insert the new row for ID 1,000,001).

That’s because, if you’ve been paying attention, nonclustered indexes carry all the key columns of your clustered index. We updated the clustered index, so we had to update our nonclustered index. If we had multiple nonclustered indexes, we’d have to update them all. This is why many sane and rational people will tell you to not pick columns you’re going to update for your clustered index.

If you’re still looking at execution plans, you’ll see the split/sort/collapse operators going into the clustered index again, but only split and sort going into the nonclustered index update.

Oh, yeah. That update.

Oh, yeah. That update.

If we run the same update on Nuisance2, and check back in on the DMVs, it took a million modifications (+5 this time; due to the data distribution, there are net 5 new rows, since there are exactly five unique values in DateFiller). But at least it didn’t take 2 million modifications to update it, right?

I still can't do math.

I still can’t do math.

Bring it on home

Why are there such big differences in the modification counts?

For the update to the ID column of Nuisance, it only took two modifications. This is because of the split/sort/collapse operations.

Split takes the update, and, as the name implies, splits it into inserts and deletes. If you think about what it would look like to change 1 through 1,000,000 to 2 through 1,000,001, it really is only two modifications:

  1. Delete row 1
  2. Insert row 1,000,001

All the other numbers in the range already exist, in order. That’s what the sort does, basically. Orders the values, and whether they need an insert or a delete to occur. The final operation, collapse, removes duplicate actions. You don’t need to delete and re-insert every number.

Unfortunately, for Nuisance2, it results in doubling the modifications required. This is true for the clustered index update, where DateFiller is the second column, and the nonclustered index update, where DateFiller is the leading column.

It doesn’t appear to be the data distribution, or the data type of the column that causes the double working. As things stand in this demo, there are only five unique values in DateFiller. I tried where it was all unique, I also tried it as DATE, and BIGINT, but in each scenario, SQL tracked 2x the number of modifications to each index.

Takeaways

I’m all for unique indexes! I’m even okay with two column PK/clustered indexes. But be really careful when assigning constraints, and make sure you test your workload against them. While they may obviously help read queries, there’s some cost to maintaining them when modifying data.

What I didn’t mention this whole time, because I didn’t want it to get in the way up there, was how long each update query took. So I’ll leave you with the statistics time and IO results for each one.

Thanks for reading!

Nuisance nonclustered index update
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 3 ms.
Table 'Nuisance'. Scan count 1, logical reads 630970, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1062 ms,  elapsed time = 1129 ms.
*/

Nuisance2 nonclustered index update
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 2 ms.
Table 'Nuisance2'. Scan count 5, logical reads 1231177, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 2484 ms,  elapsed time = 2633 ms.
*/

Nuisance clustered index update
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.
Table 'Nuisance'. Scan count 1, logical reads 9191793, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 20625 ms,  elapsed time = 20622 ms.
*/

Nuisance2 clustered index update
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 1 ms, elapsed time = 1 ms.
Table 'Nuisance2'. Scan count 1, logical reads 12191808, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 36141 ms,  elapsed time = 36434 ms.
*/

Brent says: go back and read this again, because you didn’t digest it the first time. Plus, trust me, the time it takes you to read is nowhere near what it took for Erik to get to the root cause on this. (We saw the play-by-play unfold in the company chat room.)

Why most of you should leave Auto-Update Statistics on

$
0
0

Oh God, he’s talking about statistics again

Yeah, but this should be less annoying than the other times. And much shorter.

You see, I hear grousing.

Updating statistics was bringin’ us down, man. Harshing our mellow. The statistics would just update, man, and it would take like… Forever, man. Man.

But no one would actually be able to tell me how long it took. Though many stacks of Necronomicons were wheel-barrowed out and sworn upon for various incantations of “it was faster, we timed it” and “yes, performance improved” and “no, nothing else was different”.

What I would believe, because it’s totally believable, is that perhaps statistics updated, and some plans recompiled, and that recompiling the plans made things take longer.

Okay, fair enough. But no one ever says that. I wish someone would so I could take one look at an execution plan, verify that it looks like Nyarlathotep eating the NYC skyline, and say “yeah, that’d probably take a few to compile a plan for, let’s try to figure out how to break that into smaller pieces”.

Or, you know, something else reasonable.

Where am I going with this? Oh yeah. I measured. With Extended Events. So I’m extra angry about having to use those things again. XML is a hostile datatype. Don’t let the cute and cuddly creatures on those O’Reilly books fool you. Here’s the setup for the XE session.

CREATE EVENT SESSION [StatsGather] ON SERVER 
ADD EVENT sqlserver.auto_stats(
 ACTION(sqlos.task_time,sqlserver.database_id,sqlserver.database_name,sqlserver.plan_handle,sqlserver.session_id,sqlserver.tsql_stack)
    WHERE ([package0].[equal_uint64]([sqlserver].[session_id],(52)) AND [sqlserver].[is_system]=(0)))
ADD TARGET package0.event_file(SET filename=N'C:\temp\StatsGather',max_rollover_files=(10))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

Then, I ran the same type of workload that I ran to get my statistics thresholds for automatic updates. Except, of course, this time I’m only looking at how long each update took. Not when it happened. We already know that. If you want the query I used to parse the session data, it’ll be at the end of the post. I’d rather spend your ever-shortening attention spans getting to the point.

Here’s the point:

OOH! COLORS!

OOH! COLORS!

Updating statistics, even on some pretty good sized tables, didn’t really take that long. Everything is color-coded, so you can see the row count, how many rows were modified, etc. right next to the corresponding event time and timing. The statistics in red text have nothing to do with our tests, but I left them in there for completeness. They took absolutely minuscule amounts of time.

For the really big tables, which were all in the 10 million to 100 million row range, the statistics update itself never took more than 1 second. It stuck right around the half second mark aside from a few times, in the middle oddly, which I’m blaming on:

  1. AWS
  2. Clouds
  3. Disks
  4. Juggalos.

Now, how you proceed depends on a few things.

  • Do you update statistics often enough to not need to have automatic updates?
  • Are your update routines using FULLSCAN? (auto stats updates sample a percentage of the table)
  • Do you not have data movement during the day (presumably when recompiling queries would be user-facing)?
  • Can you not afford an occasional half second statistics update?
  • Do your queries not benefit from updated statistics?

If you answer yes, yes, no, no, no, you’re not just singing about a dozen Amy Winehouse songs at once, you also might be in a situation crazy enough to warrant turning auto-update stats off.

Thanks for reading!

Begin code!

IF OBJECT_ID('tempdb..#StatsGather') IS NOT NULL
   DROP TABLE [#StatsGather];

CREATE TABLE [#StatsGather]
       (
         [ID] INT IDENTITY(1, 1)
                  NOT NULL ,
         [WaitsXML] XML ,
         CONSTRAINT [PK_StatsGather] PRIMARY KEY CLUSTERED ( [ID] )
       );

INSERT  [#StatsGather]
        ( [WaitsXML] )
SELECT  CONVERT(XML, [event_data]) AS [WaitsXML]
FROM    [sys].[fn_xe_file_target_read_file]('C:\temp\StatsGather*.xel', NULL, NULL, NULL)

CREATE PRIMARY XML INDEX [StatsGatherXML] ON [#StatsGather]([WaitsXML]);

CREATE XML INDEX [StatsGatherXMLPath] ON [#StatsGather]([WaitsXML])
USING XML INDEX [StatsGatherXML] FOR VALUE;



WITH x1 AS (
SELECT
[sw].[WaitsXML].[value]('(event/action[@name="session_id"]/value)[1]', 'BIGINT') AS [session_id] ,	
DB_NAME([sw].[WaitsXML].[value]('(event/action[@name="database_id"]/value)[1]', 'BIGINT')) AS [database_name] ,				
[sw].[WaitsXML].[value]('(/event/@timestamp)[1]', 'DATETIME2(7)') AS [event_time],
[sw].[WaitsXML].[value]('(/event/@name)[1]', 'VARCHAR(MAX)') AS [event_name],
[sw].[WaitsXML].[value]('(/event/data[@name="index_id"]/value)[1]', 'BIGINT') AS [index_id],
[sw].[WaitsXML].[value]('(/event/data[@name="job_type"]/text)[1]', 'VARCHAR(MAX)') AS [job_type],
[sw].[WaitsXML].[value]('(/event/data[@name="status"]/text)[1]', 'VARCHAR(MAX)') AS [status],
[sw].[WaitsXML].[value]('(/event/data[@name="duration"]/value)[1]', 'BIGINT') / 1000000. AS [duration],
[sw].[WaitsXML].[value]('(/event/data[@name="statistics_list"]/value)[1]', 'VARCHAR(MAX)') AS [statistics_list]
FROM [#StatsGather] AS [sw]  
)
SELECT	[x1].[session_id] ,
        [x1].[database_name] ,
        [x1].[event_time] ,
        [x1].[event_name] ,
        [x1].[index_id] ,
        [x1].[job_type] ,
        [x1].[status] ,
        [x1].[duration] ,
        [x1].[statistics_list]
FROM x1
WHERE [x1].[duration] > 0
ORDER BY [x1].[event_time]

Brent says: if you’re thinking, “Oh, but my tables have more than a hundred million rows, so stats updates would take way longer,” then it’s time to think about regularly updating your stats during maintenance windows anyway. As long as you’re doing that say, weekly, then what are the odds that you’re going to trip the stats update threshold during the week on a billion row table? And if you do, it’s time to think about partitioned views.

Splitting Strings: SQL Server 2016 To The Rescue!

$
0
0

Hold me now!

Richie had to break out the fainting couch when I read the release notes for SQL Server 2016 RC0. Why? WHY? Because of one thing. One thing I have watched so many people butcher and bungle over the years (even myself, in about 2009).

New Built-in Table-Valued Function STRING_SPLIT
STRING_SPLIT is a T-SQL function that splits input character expression by specified separator and outputs result as a table.

Oh. My. God. Becky.

Here’s the thing. I see enough people still killing parallelism with scalar valued functions. How do you make it worse? I dunno, maybe stick a WHILE LOOP in a scalar valued function and use it to split strings. Over the years, different methods have popped up. Jeff Moden has DelimitedSplit8k, through which all things are possible. Adam Machanic has a CLR version that’s pretty awesome too, if you can use CLR.

Q: How many DBAs does it take to compile a .dll?
A: Most of them

If you’re still stuck on a version of SQL lower than 2016 RC0, which is basically everyone, you should totally still explore those as options to help you split strings. If you’re in the future, you should give the new STRING_SPLIT function a go. It works pretty well! This wouldn’t be much of a post without some demo code, so here goes.

If you just want to pass in a string, it’s easy enough.

SELECT * 
FROM STRING_SPLIT('The lousy DBA splits strings with WHILE loops', ' ')

Most people who need to split strings don’t just need to split one. They’ll need to do it a whole bunch of times. To call the function over a table column, you need to use CROSS APPLY.

SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], ' ') AS [ss]
WHERE [p].[Id] = 4

Like most internal functions, you can’t readily get the code. That’s why I’m not going to do a benchmark test against other options here. It might be unfair if MS leveraged some fancy-pants super-secret-squirrel-sauce. One thing I’ll point out though, is that if you’re used to Using DelimitedSplit8K, you might miss the fact that it also returns a row number. This is particularly useful if you’re always interested in the Nth element of a returned string. In order to get that, you need to call it like so.

SELECT p.[Id], [p].[Body], [x].[value], x.[RN]
FROM [dbo].[Posts] AS [p]
CROSS APPLY (
SELECT ss.[value], 
     ROW_NUMBER() OVER (PARTITION BY [p].[Id] ORDER BY [p].[Id]) AS RN
FROM STRING_SPLIT([p].[Body], ' ') AS [ss]
      ) AS x
WHERE [p].[Id] = 4
AND x.[RN] = 2

Since this function is of the inline table valued variety, you’re free to put all sorts of different predicates on the results it returns, which come back in a column called value. You can use ranges and not equal to constructs just as easily. For brevity I’m just throwing out a couple examples for equality and IN.

SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], ' ') AS [ss]
WHERE [p].[Id] = 4
AND [ss].[value] = 'change'


SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], ' ') AS [ss]
WHERE [p].[Id] = 4
AND [ss].[value] IN ('change', 'trans', 'convert')

Where a lot of people will likely find this useful is for passing lists of values into a stored procedure or other piece of code. You can also perform normal JOINs to it.

SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
JOIN STRING_SPLIT('4,5,6,7,8,9,10,11,12,13,14,15,16', ',') AS [ss]
ON [ss].[value] = p.[Id]

SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
RIGHT JOIN STRING_SPLIT('4,5,6,7,8,9,10,11,12,13,14,15,16', ',') AS [ss]
ON [ss].[value] = p.[Id]

A couple words of warning here, though. Even though this works, if the string is passed in with spaces, there will be spaces in your results. This doesn’t change the join working for numbers, but it may for text data. The other issue, and the reason the numeric join works fine, is that it returns an NVARCHAR datatype. That means you’ll see implicit conversion warnings in your plan if your joins are performed like in this scenario.

My favorite band is The Monkees.

My favorite band is The Monkees.

But hey, don’t let that stop you from having a good time. You can use the returned value in CASE statements as well.

SELECT p.[Id], [p].[Body], [ss].[value], 
CASE WHEN [ss].[value] LIKE '[^a-zA-Z]%' THEN 1 ELSE 0 END AS [Doesn't Start With A Letter],
CASE WHEN [ss].[value] LIKE '[^0-9]%' THEN 1 ELSE 0 END AS [Doesn't Start With A Number]
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], ' ') AS [ss]
WHERE [p].[Id] = 4

If your needs are bit more exotic, and you need to split on CHARACTERS WHOSE NAME SHALL NOT BE PRINTED, you can pass in N/CHAR values as well.

SELECT p.[Id], [p].[Body], [ss].[value]
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], CHAR(10)) AS [ss]
WHERE [p].[Id] = 4

And, of course, you can perform regular Inserts, Updates, and Deletes, with reference to STRING_SPLIT’s value column. Quick example with SELECT…INTO!

SELECT p.[Id], [p].[Body], [ss].[value]
INTO #temp
FROM [dbo].[Posts] AS [p]
CROSS APPLY STRING_SPLIT([p].[Body], ' ') AS [ss]
WHERE [p].[Id] = 4

That’s about as much as I’ve done with it so far. If you have any cool use cases, leave a comment.

Thanks for reading!

The Joy of Joining on NULLs

$
0
0

With all the trouble NULLs cause…

You’d think people would be more inclined to avoid them. Slap a NOT NULL constraint and a default value on your column and call it a day. I’m perfectly fine with bizarro world canary values. If it’s an integer column, some really high (low?) negative number. If it’s date-based, why not have it be the lowest value your choice accomodates?

But no, no, no. Instead of stopping the bleeding, developers tend to just avoid getting blood on the new carpet. Some of the worst performing queries I see have ISNULL(something, '') = ISNULL(somethingelse, '') either in the JOIN or WHERE clause. Still. To this day. In the age of the hoverboard, people are still putting functions where they don’t belong. And I know they have the internet.

But did you really mean that?

Most people will slap together their ISNULL manifesto and call it the dog end of a day gone by. But most of them don’t realize how that changes the logic of the query.

For instance, I see ISNULL(intcol, 0) = 0 pretty often. Most people are surprised that rows with 0 in them will also be included here.

But for joins, well, let’s look at what’s really going on.

Teeny tiny tables, appear!

--100 rows, every 3rd date is null
SELECT TOP 100
 [x].[Rn]  AS [ID],
CASE WHEN [x].[Rn] % 3 = 0 THEN NULL ELSE CAST(DATEADD(HOUR, [x].[Rn], GETDATE() ) AS DATE) END AS [OrderDate]
INTO #t1
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL) ) Rn
FROM sys.[messages] AS [m],
     sys.[messages] AS [m2],
     sys.[messages] AS [m3]
) x

CREATE CLUSTERED INDEX [cxt1] ON [#t1] ([ID], [OrderDate])

--100 rows, every 5th date is null
SELECT TOP 100
[y].[Rn]  AS [ID],
CASE WHEN [y].[Rn] % 5 = 0 THEN NULL ELSE CAST(DATEADD(HOUR, [y].[Rn], GETDATE() ) AS DATE) END AS [OrderDate]
INTO #t2
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL) ) Rn
FROM sys.[messages] AS [m],
     sys.[messages] AS [m2],
     sys.[messages] AS [m3]
) y

CREATE CLUSTERED INDEX [cxt2] ON [#t2] ([ID], [OrderDate])

We don’t even need 100 rows for this, but whatever. Let’s join these muddy truckers. Here’s a pretty average inner join, on both ID and OrderDate. It returns 53 rows; all 100 IDs match, but the NULL dates prohibit some joins from occurring.

SELECT *
FROM [#t1] AS [t]
JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]

The results will be something like this. IDs are missing where dates are NULL. For us, that’s every 3rd column in t1, and every 5th column in t2.

Missing links

Missing links

SELECT *
FROM [#t1] AS [t]
JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND ISNULL([t2].[OrderDate], '1900-01-01') = ISNULL([t].[OrderDate], '1900-01-01')

Most people will think this will get them out of the woods, but it only fixes six of the NULL joins.

Hey! Something happened!

Hey! Something happened!

We’ll call them the FizzBuzz six, because…

SELECT *
FROM [#t1] AS [t]
JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
WHERE [t].[OrderDate] IS NULL
AND [t2].[OrderDate] IS NULL
Jammin' on the 3. And 5.

Jammin’ on the 3. And 5.

You guessed it, only numbers divisible by three AND five get joined on our canary value. If we switch over to a left join, we’ll at least see where the gaps are.

SELECT *
FROM [#t1] AS [t]
LEFT JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]
ORDER BY [t].[ID]
And uh, carry the... Whatever.

And uh, carry the… Whatever.

Since t1 is the left side, the gaps are every 3rd ID. With t2 on the right side of the join, every 3rd and 5th ID is NULL. To really get the full picture, we’ll need to switch to a full join. This is instructive because it gives us ALL the unmatched rows.

Check out the next three queries, and then why each is important.

SELECT *
FROM [#t1] AS [t]
FULL JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]
ORDER BY [t].[ID], [t2].[ID]

SELECT *
FROM [#t1] AS [t]
FULL JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]
WHERE [t].[OrderDate] IS NOT NULL
ORDER BY [t].[ID], [t2].[ID]

SELECT *
FROM [#t1] AS [t]
FULL JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]
WHERE [t2].[OrderDate] IS NOT NULL
ORDER BY [t].[ID], [t2].[ID]

For the full join with no WHERE clause, we get back 147 rows. In the next two queries, we filter out NULLs from each of the date columns.

In the second query, 33 rows are missing. That makes sense, because there are 33 rows between 1 and 100 that are evenly divisible by 3.

For the third query, there are 20 rows missing. If you’re not a Mathlete, yes, 20 rows between 1 and 100 are evenly divisible by 5.

This goes back to our unfiltered query: 200 – 20 – 33 = 147.

There are your missing 57 rows.

So what did you really need?

Think carefully about what you’re trying to return with your query. It’s highly unlikely that ISNULL solved your problem, and that you needed all matching results, plus matching results where BOTH are NULL. If you did, this would be a better way to do that:

SELECT *
FROM [#t1] AS [t]
JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
WHERE [t].[OrderDate] IS NULL
AND [t2].[OrderDate] IS NULL

UNION ALL

SELECT *
FROM [#t1] AS [t]
JOIN [#t2] AS [t2]
ON [t2].[ID] = [t].[ID]
AND [t2].[OrderDate] = [t].[OrderDate]
ORDER BY [t].[ID]

Killing Join

Remember that joins describe the relationship between your tables, and the where clause describes what you want (or don’t want) to keep from the table relationship.

Think about the first time you stupidly got an apartment with someone you just started dating. You joined your furniture based on the rooms in the apartment, and then filtered out what you weren’t keeping.

If you were me, all your cool guy stuff ended up at the curb. Heckuva where clause, that was.

This post goes out to Paul White, who also wants his t-shirts back. You know who you are!

Thanks for reading!

Upcoming webcasts on Always On, career planning, VLDBs, and more. Register now.

Make Extended Events Great… Finally

$
0
0

SQL Server 2016 is adding some really awesome stuff

Not the least of which is a ton of super cool information in execution plans, and an Extended Events session to expose the same information for all queries. The details are detailed in detail over on this blog post from MS.

For those of you with clicking allergies, a whole mess o’ run-time information is being recorded in execution plans, and by a new Extended Events session called query_thread_profile. Microsoft may be changing, but they still pick horrible names for things. Query Thread Profile sounds like one of those stores the size of a closet that sells four t-shirts that all cost $1000.

Let’s play with Extended Events

You can’t see any of this stuff in the graphical execution plan yet, only the plan XML. It’s only in actual execution plans, and in the XE session. If you read that and paused made a face, yeah, the run-time stats aren’t in cached or estimated plans. With Query Store’s capabilities, it’s probably less necessary to jam this stuff into every DMV and cache, so I’m not all that bummed about it.

Anywho! If you’ve got SQL Server 2016 RC0 or later installed, you can plug this in and run some queries. You will probably need to change the session ID I’m filtering on here.

CREATE EVENT SESSION [RunTimeStats] ON SERVER 
ADD EVENT sqlserver.query_thread_profile(
    ACTION(sqlos.task_time,sqlserver.database_name,sqlserver.plan_handle,sqlserver.session_id,sqlserver.sql_text)
    WHERE ([sqlserver].[session_id]=(55)))
ADD TARGET package0.event_file(SET filename=N'C:\temp\RunTimeStats',max_rollover_files=(10))
GO

I’ll be running stuff against StackOverflow, as usual. If you write dumb enough queries, you can really keep a server (laptop) busy for a while. A quick word of warning: this session will store one row per node, per thread, per query. You can really quickly end up with a lot of data in here. You’ll see in a minute what I mean, but let’s run some queries first. You will be sorely disappointed with the data if you don’t run any queries.

ALTER EVENT SESSION [RunTimeStats] ON SERVER STATE = START;

SELECT [DisplayName], [p].[Title]
FROM [dbo].[Users] AS [u]
JOIN [dbo].[Posts] AS [p]
ON [p].[OwnerUserId] = [u].[Id]
WHERE [u].[ID] = 26837
AND [p].[PostTypeId] = 1
GO

SELECT [DisplayName], [p].[Title]
FROM [dbo].[Users] AS [u]
JOIN [dbo].[Posts] AS [p]
ON [p].[OwnerUserId] = [u].[Id]
WHERE [u].[ID] = 483467
AND [p].[PostTypeId] = 1
GO

SELECT [u].[DisplayName]
FROM [dbo].[Users] AS [u]
WHERE [u].[Id] = 26837
GO 

SELECT [u].[DisplayName]
FROM [dbo].[Users] AS [u]
WHERE [u].[Id] = 483467
GO 

ALTER EVENT SESSION [RunTimeStats] ON SERVER STATE = STOP;

Code to parse the session data will be at the end, as usual. Good grief, I have an “as usual” for Extended Events posts. Here’s what we get back.

Information Society

Information Society

That thing I said about nodes and threads, here’s a good example of it. The query plan and text columns will all look about the same, because we ran basically the same query two times twice. Node three of the first query used four threads. Remember that in every parallel execution, there’s one coordinator thread that doesn’t do any actual query work, that’s why you have 0-4 for thread IDs, totaling five.

Looking at the plan from our XML column confirms most of this. But again, cached plans offer less information overall. If you get the actual plan here, you’ll be able to see degree of parallelism information.

We're oh so parallel.

We’re oh so parallel.

Checking out the rest of the results, all the promised information is in there: CPU time, total time, rewinds, rebinds, row information, IO information, and so forth. Very powerful stuff. I hope this turns into an easy way to detect imbalances in parallel plans — when one or more threads process way more rows than the rest. That can really hurt you if, say, one thread ends up processing a million rows, and the rest aren’t doing anything. Not very parallel, that.

But one thing I think is really cool, and it’s not just information you can get from this XE session! You can get it from a couple others that I’ll blog about in the future, is what input(s) queries ran with when they’ve been parameterized, or run as stored procedures with inputs passed in. Not just the compiled value, the run-time value! Hooray. Think of all the nights and weekends you can spend fixing things! Just like I pictured the 21st century.

For our simple queries against the user table, here’s what you get for the execution plan. You can see it’s been, simple… simply… parameter… ized? There’s been simple parameterization. There we go. That’s the @1 smallint business going on in there.

You are going to need much bigger guns!

You are going to need much bigger guns!

And here’s what the XE query brings back for query text.

Faced!

Faced!

More trouble coming every day

SQL Server 2016 is looking pretty promising in quite a few areas. I’ll be blogging about all the stuff that catches my eye, so stay tuned.

Thanks for reading!

Begin code!

SELECT 
[deqp].[query_plan],
WaitsXML.value('(event/action[@name="sql_text"]/value)[1]', 'VARCHAR(MAX)') AS [sql_text],
WaitsXML.value('(event/action[@name="database_name"]/value)[1]', 'VARCHAR(MAX)') AS [database_name],
WaitsXML.value('(event/action[@name="session_id"]/value)[1]', 'VARCHAR(MAX)') AS [session_id],
DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()), WaitsXML.value('(event/@timestamp)[1]', 'DATETIME2')) AS [event_time] ,
WaitsXML.value('(event/data[@name="node_id"]/value)[1]', 'BIGINT') AS [node_id],
WaitsXML.value('(event/data[@name="thread_id"]/value)[1]', 'BIGINT') AS [thread_id],
WaitsXML.value('(event/data[@name="cpu_time_us"]/value)[1]', 'BIGINT') / 1000 AS [cpu_time_ms],
WaitsXML.value('(event/data[@name="total_time_us"]/value)[1]', 'BIGINT') / 1000 AS [total_time_ms],
WaitsXML.value('(event/data[@name="actual_rewinds"]/value)[1]', 'BIGINT') AS [actual_rewinds],
WaitsXML.value('(event/data[@name="actual_rebinds"]/value)[1]', 'BIGINT') AS [actual_rebinds],
WaitsXML.value('(event/data[@name="actual_execution_mode"]/text)[1]', 'VARCHAR(MAX)') AS [actual_execution_mode],
WaitsXML.value('(event/data[@name="actual_rows"]/value)[1]', 'BIGINT') AS [actual_rows],
WaitsXML.value('(event/data[@name="actual_batches"]/value)[1]', 'BIGINT') AS [actual_batches],
WaitsXML.value('(event/data[@name="io_reported"]/value)[1]', 'VARCHAR(MAX)') AS [io_reported],
WaitsXML.value('(event/data[@name="actual_logical_reads"]/value)[1]', 'BIGINT') AS [actual_logical_reads],
WaitsXML.value('(event/data[@name="actual_physical_reads"]/value)[1]', 'BIGINT') AS [actual_physical_reads], 
WaitsXML.value('(event/data[@name="actual_ra_reads"]/value)[1]', 'BIGINT') AS [actual_ra_reads], 
WaitsXML.value('(event/data[@name="actual_writes"]/value)[1]', 'BIGINT') AS [actual_writes]
FROM [#RunTimeStats]
OUTER APPLY [sys].[dm_exec_query_plan](WaitsXML.value('xs:hexBinary((event/action[@name="plan_handle"]/value)[1])', 'VARBINARY(64)')) AS [deqp]
ORDER BY WaitsXML.value('(event/@timestamp)[1]', 'DATETIME2'), WaitsXML.value('(event/data[@name="node_id"]/value)[1]', 'BIGINT'), WaitsXML.value('(event/data[@name="thread_id"]/value)[1]', 'BIGINT')

Upcoming webcasts on Always On, career planning, VLDBs, and more. Register now.

Generating test data without complicated T-SQL

$
0
0

Sometimes you need garbage

Not because DBAs are the IT equivalent of Oscar the Grouch, but maybe you want to post a question on a forum, and you don’t want to use your own data. At least, you probably shouldn’t just post your own data without masking it. But masking data is annoying, and by the time you get everything in order, someone’s breathing down your neck for an answer.

Your other option is to write T-SQL to generate random data of your choosing, but that can be daunting too. Generating random numbers and dates between ranges, creating relationships, etc. isn’t a ton of fun. Adding more tables makes the whole process exponentially caustic.

Enter the website

A website I really like to help generate data is over here. It’s really easy to set up data the way you want, and you get a lot of neat customization options for a wide variety of data types.

Cold Gettin' Dumb

Cold Gettin’ Dumb

You can have it generate the data to just about any format you can think of: CSV, Excel, SQL, JSON, HTML, XML, and you have all sorts of options for each type. It’s really awesome.

It's Like That

It’s Like That

 

Next time you need to generate some test data, consider trying this site out. It’s been really helpful for me on a bunch of occasions. If you like it, donate. There’s no such thing as a free table.

Thanks for reading!

Upcoming webcasts on Always On, career planning, VLDBs, and more. Register now.

Minimal Logging when you can’t change the code

$
0
0

What a gem!

Minimal logging is super cool. If you follow all the rules, there’s a pretty good chance it will work! Unfortunately, it’s not always up to you. All sorts of ISVs put out all sorts of bad ideas in form of code. Sort of like in Ghostbusters, when Gozer tells the crew to choose the form of the destructor, and Ray thinks of the Stay-Puft Marshmallow Man. Except in your case, the form of your destructor is a flat file bulk loading process that you can’t change any part of.

So here’s the test setup. Usual caveats about doing this in development, alright? Deal? Deal!

Create a table and dump a million rows of crap into it. A million rows is a good starting place for table sizes. I’m leaving it as a HEAP here, because this is one of the few occasions I’m cool with people having HEAPs. Nothing is faster for dumping data into.

Good ol’ HEAPs. Everything is faster when you don’t care.

USE [Sample]
SET NOCOUNT ON 

IF OBJECT_ID('dbo.BulkLoadTest') IS NOT NULL
   DROP TABLE [dbo].[BulkLoadTest];
USE [Sample]
GO

CREATE TABLE [dbo].[BulkLoadTest](
  [ID] [bigint] IDENTITY (1,1) NOT NULL,
  [DumbGUID] [uniqueidentifier] NOT NULL,
  [PO] [varchar](9) NOT NULL,
  [OrderDate] [date] NOT NULL,
  [ProcessDate] [date] NOT NULL,
  [ShipDate] [date] NOT NULL,
  [SalesPersonID] [int] NULL,
  [CustomerID] [int] NOT NULL,
  [SalesOrder] [varchar](10) NOT NULL,
  [PurchaseOrder] [varchar](10) NOT NULL,
  [OrderSubTotal] [numeric](18, 2) NOT NULL,
  [OrderTax] [numeric](18, 2) NOT NULL,
  [OrderShipCost] [numeric](18, 2) NOT NULL,
  [SalesNotes] [nvarchar](MAX) NULL,
  [isBit] [int] NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]


;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
INSERT [dbo].[BulkLoadTest] WITH (TABLOCK)
( [DumbGUID] ,[PO] ,[OrderDate] ,[ProcessDate] ,[ShipDate] ,[SalesPersonID] ,[CustomerID] ,[SalesOrder] ,[PurchaseOrder] , [OrderSubTotal], [OrderTax] ,[OrderShipCost] ,[SalesNotes] ,[isBit] )
SELECT  
        NEWID() AS [DumbGUID] , 
        SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) AS [PO] ,
        CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE())) AS [OrderDate] ,
        CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE() + 1)) AS [ProcessDate] ,
        CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE() + 3)) AS [ShipDate] ,
        ABS(CHECKSUM(NEWID()) % 100) + 1 AS [SalesPersonID], 
        ABS(CHECKSUM(NEWID()) % 100000000) + 1 AS [CustomerID], 
        'S' + SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) AS [SalesOrder], 
        'P' + SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) AS [PurchaseOrder], 
        ABS(CONVERT(NUMERIC(18,2), (CHECKSUM(NEWID()) % 50000.50))) AS [OrderSubTotal], 
        ABS(CONVERT(NUMERIC(18,2), (CHECKSUM(NEWID()) % 100.99))) AS [OrderTax], 
        ABS(CONVERT(NUMERIC(18,2), (CHECKSUM(NEWID()) % 500.88))) AS [OrderShipCost], 
        CASE WHEN [N].[N] % 19 = 0  THEN REPLICATE (CONVERT(NVARCHAR(MAX), 'BOU'), 8000) 
        WHEN [N].[N] % 17 = 0 THEN NULL
        ELSE REPLICATE(CAST(NEWID() AS NVARCHAR(MAX)), CEILING(RAND() / 10 + 1)) END AS [SalesNotes], 
        CASE WHEN [N].[N] % 17 = 0 THEN 1 ELSE 0 END AS [isBit]
FROM    [Numbers] [N]
ORDER BY [N] DESC

Since we need to test a bulk process, we’re using BCP. If you’re cool with xp_cmdshell, you can run this. If not, you’re gonna have to crack open a command prompt and run the bcp commands on their own. You’ll probably have to change the file destinations and server names, unless you hacked into my laptop.

EXEC xp_cmdshell 'bcp Sample.dbo.BulkLoadTest out C:\temp\blt.csv -w -t, -S localhost\NADA2014 -T'

EXEC xp_cmdshell 'bcp Sample.dbo.BulkLoadTest format nul -w -t, -f C:\temp\blt.fmt -S localhost\NADA2014 -T'

This gives us a comma delimited file of all our table data, and a format file to feed to the bulk insert process. You don’t need the format file for this to work, I just prefer to use them when possible. You can do all sorts of cool stuff with format files!

If we had full control, we could very easily load data in through BULK INSERT like this:

BULK INSERT dbo.BulkLoadTest 
   FROM 'C:\temp\blt.csv'
   WITH (
         BATCHSIZE      = 1000000
        ,FORMATFILE     =  'C:\temp\blt.fmt'
        ,MAXERRORS      = 2147483647
        ,ERRORFILE      = 'C:\temp\blt.errlog'
        ,TABLOCK
        )

That takes about 16 seconds on my laptop, dumping in a million row batch from the file we output. If you’re working with more rows, you may want to break the batch size down into smaller chunks.

Do anything for TABLOCK

The magic happens with the TABLOCK OPTION. Without TABLOCK specified, this runs for around 40 seconds. That’s a bit more than double the time involved when we use TABLOCK, and minimal logging occurs.

But of course, we’re working with a file loading GUI, or something that just doesn’t let us make those kind of changes. So what do we do?

There’s a silly old system stored procedure out there that lets us change certain default options for tables. It is called, shockingly, sys.sp_tableoption. How does it help us here? One of the options is “table lock on bulk load”. That would have been more dramatic if I didn’t give you the link first, huh?

EXEC [sys].[sp_tableoption]
    @TableNamePattern = N'dbo.BulkLoadTest' ,
    @OptionName = 'table lock on bulk load' , 
    @OptionValue = 'ON'

This buys us just about the same real estate as if we used TABLOCK in the BULK INSERT statement. Here are the run times for each!

No TABLOCK:

/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 5 ms.
Table 'BulkLoadTest'. Scan count 0, logical reads 1029650, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 19671 ms,  elapsed time = 38798 ms.
Table 'BulkLoadTest'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 3 ms.
*/

TABLOCK:

/*
 SQL Server Execution Times:
   CPU time = 15688 ms,  elapsed time = 16595 ms.
*/

No TABLOCK, table option set:

/*
SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 192 ms.

 SQL Server Execution Times:
   CPU time = 15297 ms,  elapsed time = 16490 ms.
*/

You don’t have to be an internals wizard

When minimal logging works, the difference in timing and I/O is quite apparent. There are undocumented functions you can use to further prove your point, just don’t run them unless you’re on SQL Server 2014, or 2012 SP2 or higher. On all the other versions, there’s a bug where the function creates threads that won’t go away until you restart. So, yeah. Don’t do that. Unless you’re into thread starvation.

If you’re a customer, this is a good way to prove to your vendor that minimal logging is cool thing to get working. Lots of applications rely on flat file loading. I’m a big proponent of using staging databases that sit in simple recovery model for stuff like this, so you’re not whomping your client-facing database, and you don’t have to switch it from full to bulk logged to use minimal logging.

Thanks for reading!

Upcoming webcasts on Always On, career planning, VLDBs, and more. Register now.


Psst… Hey buddy, you want some E-Discovery?

$
0
0

One year ago today!

Well, sort of one year ago. Who knows when this thing will get published? Only God and Brent. And part of the calendar year fell during a leap year, which was just plain cruel, like a two part Buffy episode where the second part is the first episode of the next season DAMN YOU JOSS WHEDON!

Anyway, I started working here, annoying you guys with blog posts, giving Doug rub-downs in between video takes, and walking Ernie when the Ozar family was too full of illegal caviar and albino truffles to move. I also started running down the clock on being able to work with my favorite piece of software again. You can probably guess.

Relativity!

kcura_relativitySeriously, I love this thing. Not just because many of the databases I worked with under the software were hundreds of gigs, on up to 9 terabytes, but because the people behind the software really do care about the product. The customer support is aces (Hello, Pod One), and the developers are super helpful and responsive.

Plus, it’s just plain interesting. You give lawyers this crazy interface that lets them build just about any search query they can dream of, including some really, really bad ones, and see how SQL Server reacts.

If you’re a DBA who has read the execution plan of a Relativity query where saved searches reference saved searches that reference saved searches that… you get the point! I feel your pain.

It’s not just the hardware

You can’t fix every saved search and workflow immediately, which makes right-sizing hardware super important, but that’s not the only thing. Every case is different, and they often need custom indexes.

If you’re a DBA who has watched performance tank because some new search suddenly started scanning the clustered index of your 50 million row, one terabyte Documents table with a wildcard search on Email Body and Email Subject and Email Sender and Email Recipients and Email Metadata for ‘insert super common word here’, I feel your pain.

Best in Service

My other favorite part about Relativity is that they have standards. Not last call for alcohol standards, either. They keep you, as a DBA, honest. No backups? Ding. No DBCC CHECKDB? Ding. Perf in the tank? Ding.

The challenges that you’re presented with at scale are immense. You have 100 terabytes of data and you need to check it for corruption weekly. How’s that gonna work?

Index and statistics maintenance can be super important, too. Fragmentation may not matter if you’re reading 1500 pages instead of 1000 pages, but it can sure as heck matter when you’re reading 1,500,000 pages rather than 1,000,000 pages. And all those ascending keys? SQL Server is gonna make some real bad judgement calls on those, especially prior to 2014.

It’s a wild, wild life

You have users bulk loading thousands to millions of documents, users updating records during review, and then searches running on top of all that.

I am thrilled to be able to work with my favorite product again. If you’re experiencing Relativity pains, drop us a line.

Thanks for reading!

Brent says: For a refresher on this app and how we work with it, check out our past posts on The SQL Server Components of kCura Relativity, Performance Tuning kCura Relativity, Using Partitioning to Make kCura Relativity Faster, and Tiering Relativity Databases.

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Indexes Helping Indexes

$
0
0

This post isn’t going to solve any problems for you

It was just something I stumbled on that struck me as funny, while working on a demo for something else. Brent didn’t believe it at first, so I thought I’d share with the class. Blog. You, Your Name Here. Hiya!

So there I was, creating some indexes. Since I was in demo mode, I had execution plans turned on. I’m actually terrible about remembering to turn them off. There have been times when I’ve run loops to create some REALLY big tables, and left them turned on. SSMS crashed pretty quickly. Apparently generating execution plans isn’t free! Keep that in mind when you’re timing/tuning queries.

Since they were there, I took a look at them. They were basically what I expected. The only way to index data is to read it and sort it and blah blah blah. But there was a little something extra!

Cycle of life.

Cycle of life.

Monkey’s uncle

Well would you look at that? Missing index requests. While creating indexes. If that isn’t the ultimate cry for help, I don’t know what is. The first time I created them, it took 1:07. Which isn’t even bad for five indexes on a table the size of Posts in Stack Overflow.

But I figured, hey, science. Let’s see how much an index can help when creating indexes, and I created a fairly ridiculous index based on all the missing index requests for the five I’m making.

CREATE NONCLUSTERED INDEX [ix_helper] ON dbo.[Posts] ([PostTypeId]) 
INCLUDE ([Id], [OwnerUserId], [AnswerCount], [CommentCount], [FavoriteCount], [Score], [ViewCount])

So it’s one key column, and five includes. SQL doesn’t care about the order of includes. They’re not sorted anywhere, they just hang out at the leaf level of the index and make the optimizer happy when it doesn’t have to do a million key lookups back to the clustered index.

Better than fingers.

Better than fingers.

Then I went back and created my indexes again. It took 44 seconds this time. That’s an exciting drop of, like… 16 seconds plus 7 seconds. Math math math. 23 seconds! It only took about 66% as long as before.

So not only was there a decent drop in time, but SQL did this crazy index seek on my god awful one key column and seven include index.

I think it’s pretty neat that SQL is smart enough to do this, and not just strangle your clustered index every time it creates a new index. I mean, it still may, if you don’t have any other helpful indexes, but that’s to be expected.

Just in case you don’t believe me, here are the execution plans from after I added my silly index. Of course, you probably shouldn’t go out and create indexes to help you create indexes to help you create indexes… You get my drift. It’s not worth the time and effort.

My point here is that it’s nice that SQL will naturally use appropriate indexes during index creation.

Sorting is horrible.

Sorting is horrible.

What did we learn?

SQL cares! Not about you, really. But it cares about resources. It will take all sorts of stuff into account when it needs to fetch you some data. If it can use smaller structures and hog up less resources, it will do that.

Thanks for reading!

Brent says: SQL Server says: “Creating this index sure would be faster if I had an index, man.”

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

ISNULL vs. COALESCE: What The XML?

$
0
0

This isn’t about performance

If you’re interested in performance tests, you can get in the way back machine and read a couple posts by Adam Machanic here and here. I’m also not talking about the difference between them. There are a million articles about that. They’re obviously different, but how does SQL handle that internally?

We’re going to run the four queries below separately, and look at the XML of the execution plan for each.

SELECT TOP 1000
    COALESCE([u].[Id], [u].[Age]) 
FROM
    [dbo].[Users] AS [u];
GO 

SELECT TOP 1000
    COALESCE([u].[Age], [u].[Id]) 
FROM
    [dbo].[Users] AS [u];
GO 

SELECT TOP 1000
    ISNULL([u].[Id], [u].[Age])
FROM
    [dbo].[Users] AS [u];
GO 

SELECT TOP 1000
    ISNULL([u].[Age], [u].[Id])
FROM
    [dbo].[Users] AS [u];
GO

Even though COALESCE is the ANSI standard, I still see most people using ISNULL. Spelling? Laziness? Who knows? Under the covers, it’s just a CASE expression. You could write the same thing yourself.

No, I don't think JSON is a good idea either.

No, I don’t think JSON is a good idea either.

The second query gives you almost the same thing, but the columns are reversed. What’s the point, then?

ISNULL ISDIFFERENT

SQL does something rather smart here. The Id column is NOT NULL, so when ISNULL is applied to it, it doesn’t bother evaluating anything.

That seems reasonable.

That seems reasonable.

Reversed, ISNULL, and whatever magical behind-the-scenes code Microsoft has thrown in there, runs against the Age column, which is NULLable.

Buzzkill.

Buzzkill.

Pros and Cons

ISNULL Pros

  • Easy to spell
  • Shortcuts if you do something dumb
  • Not the name of a horrible band

ISNULL Cons

  • Only two inputs
  • Not ANSI standard
  • Maybe doesn’t shower regularly

Coalesce Pros

  • ANSI Standard
  • Multiple inputs
  • Drinks coffee black

Coalesce Cons

  • Spelling
  • Just a case expression
  • Is the name of a horrible band

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Another Hidden Parallelism Killer: Scalar UDFs In Check Constraints

$
0
0

Every single time

Really. Every single time. It started off kind of funny. Scalar functions in queries: no parallelism. Scalar functions in computed columns: no parallelism, even if you’re not selecting the computed column. Every time I think of a place where someone could stick a scalar function into some SQL, it ends up killing parallelism. Now it’s just sad.

This is (hopefully. HOPEFULLY.) a less common scenario, since uh… I know most of you aren’t actually using any constraints. So there’s that! Developer laziness might be a saving grace here. But if you read the title, you know what’s coming. Here’s a quick example.

USE [tempdb]
SET NOCOUNT ON

SELECT TOP 10000
    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) [ID] ,
    DATEADD(MINUTE, [m].[message_id], SYSDATETIME()) [SomeDate]
INTO
    [dbo].[constraint_test]
FROM
    [sys].[messages] AS [m] ,
    [sys].[messages] AS [m2];
GO 

CREATE FUNCTION [dbo].[DateCheck] ( @d DATETIME2(7) )
RETURNS BIT
       WITH RETURNS NULL ON NULL INPUT
AS
    BEGIN
        DECLARE @Out BIT;
        SELECT
            @Out = CASE WHEN @d < DATEADD(DAY, 30, SYSDATETIME()) THEN 1
                        ELSE 0
                   END;
        RETURN @Out;
    END;
GO 

ALTER TABLE [dbo].[constraint_test]  ADD CONSTRAINT [ck_cc_dt] CHECK([dbo].[DateCheck](SomeDate) = 1)

SELECT *
FROM [dbo].[constraint_test] 
OPTION(QUERYTRACEON 8649, MAXDOP 0, RECOMPILE)

Parallelism appears to be rejected for maintenance operations as well as queries, just like with computed columns.

Interestingly, if we look in the plan XML (the execution plan itself just confirms that the query didn’t go parallel) we can see SQL tried to get a parallel plan, but couldn’t.

Garbagio

Garbagio

There’s a short list of possible reasons for plans not going parallel here from a while back. A quick search didn’t turn up a newer or more complete list.

Check yourself, etc. and so forth

How do you know if this is happening to you? Here’s a simple query to look at constraint definitions and search them for function names. This query is dumb and ugly, but my wife is staring at me because it’s 5:30 on a Saturday and I’m supposed to be getting ready. If you have a better idea, feel free to share in the comments.

WITH    [c1] AS ( 
  SELECT [name] , [definition]
  FROM [sys].[check_constraints]
  UNION ALL
  SELECT [name] , [definition]
  FROM [sys].[default_constraints] 
)
SELECT *
FROM [c1], [sys].[objects] AS [o] 
WHERE [o].[type] IN ('FN', 'TF') 
AND [c1].[definition] LIKE '%' + o.[name] + '%'

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Old and Busted: DBCC Commands in 2016

$
0
0

I hate DBCC Commands

Not what they do, just that the syntax isn’t consistent (do I need quotes around this string or not?), the results are a distraction to get into a usable table, and you need to write absurd loops to perform object-at-a-time data gathering. I’m not talking about running DBCC CHECKDB (necessarily), or turning on Trace Flags, or any cache-clearing commands — you know, things that perform actions — I mean things that spit tabular results at you.

Stuff like this puts the BC in DBCC. It’s a dinosaur.

In SQL Server 2016

DBCC INPUTBUFFER got its own DMV after, like, a million decades. Commands like DBCC SQLPERF, DBCC DBINFO, and DBCC LOGINFO should probably get their own. Pinal Dave has a whole list of DBCC Commands that you can break your server with here.

But truly, the most annoying one to me is DBCC SHOW_STATISTICS. It’s insane that there’s no DMV or function to expose histogram information. That’s why I filed this Connect item.

UPDATE: It looks like Greg Low beat me to it by about 6 years. Too bad searching Connect items is so horrible. I urge you to vote for Greg’s item, instead.

Statistics are the intel SQL Server uses to make query plan choices.

Making this information easier to retrieve, aggregate, join to other information, and analyze would put a powerful performance tuning tool into the hands of SQL Server users, and it would help take some of the mystery surrounding statistics away.

Please consider voting for my Greg’s Connect item.

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Stats Week: Statistics Terminology Cheatsheet

$
0
0

These things used to confuse me so much

Despite having worked at a Market Research company for a while, I know nothing about statistics, other than that project managers have all sorts of disagreeably subjective phrases for describing them. Vast majority, convincing plurality, dwindling minority, et al. Less talky, more picture.

When I started getting into SQL Server, and learning about statistics, I heard the same phrases over and over again, but wasn’t exactly sure what they meant.

Here are a few of them:

Selectivity

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

Density

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Cardinality

If you mash selectivity and density together, you end up with cardinality. This is the number of rows that satisfy a given predicate. This is very important, because poor cardinality estimation can arise from a number of places, and every time it can really ruin query performance.

Here’s a quick example of each for a 10,000 row table with three columns.

USE [tempdb];

WITH x AS (
SELECT TOP 10000
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [rn]
FROM sys.[messages] AS [m]
)
SELECT
[x].[rn],
CASE WHEN [x].[rn] % 2 = 0 THEN 'M' ELSE 'F' END AS [Gender],
CASE WHEN [x].[rn] % 2 = 0 THEN 'Married' WHEN [x].[rn] % 3 = 0 THEN 'Divorced' WHEN [x].[rn] % 5 = 0 THEN 'Single' ELSE 'Dead' END AS [MaritalStatus]
INTO #xgen
FROM [x]

/*Selectivity*/
SELECT COUNT_BIG(DISTINCT [x].[rn])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(DISTINCT [x].[Gender])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(DISTINCT [x].[MaritalStatus])
FROM [#xgen] AS [x]

/*Density*/
SELECT (1. / COUNT_BIG(DISTINCT [x].[rn]))
FROM [#xgen] AS [x]

SELECT (1. / COUNT_BIG(DISTINCT [x].[Gender]))
FROM [#xgen] AS [x]

SELECT (1. / COUNT_BIG(DISTINCT [x].[MaritalStatus]))
FROM [#xgen] AS [x]

/*Reverse engineering Density*/
SELECT 1.0 / 0.00010000000000000000

SELECT 1.0 / 0.50000000000000000000

SELECT 1.0 / 0.25000000000000000000

/*Cardinality*/
SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[rn])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[Gender])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[MaritalStatus])
FROM [#xgen] AS [x]

DROP TABLE [#xgen]

 

Bigger by the day

A lot has been written about cardinality estimation. SQL Server 2014 saw a total re-write of the cardinality estimation guts that had been around since SQL Server 2000, build-to-build tinkering notwithstanding.

In my examples, it’s all pretty cut and dry. If you’re looking at a normal sales database that follows the 80/20 rule, where 80 percent of your business comes from 20 percent of your clients, the customer ID columns may be highly skewed towards a small group of clients. It’s good for SQL to know this stuff so it can come up with good execution plans for you. It’s good for you to understand how parameter sniffing works so you understand why that execution plan was good for a small client, but not good for any big clients.

That’s why you should go see Brent in person. He’ll tell you all this stuff, feed you, give you prizes, and then you go home and get a raise because you can fix problems. Everyone wins!

Thanks for reading!

Brent says: wanna learn more about statistics? Check out Dave Ballantyne’s past SQLbits videos, including the one about the new 2014 CE.

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Stats Week: Do Query Predicates Affect Histogram Step Creation?

$
0
0

Auto Create Statistics is your friend

It’s not perfect, but 99% of the time I’d rather have imperfect statistics than no statistics. This question struck me as interesting, because the optimizer will totally sniff parameters to compile an initial plan. If you don’t have index statistics, or system statistics already on a column in a WHERE clause, SQL is generally kind enough to create a statistics object for you when the query is compiled.

So I thought to myself: Would SQL create an initial histogram based on the compile-time parameter? It might be nice if it did, since it could potentially get the best possible information about predicate cardinality from a direct hit on a histogram step.

Here’s a quick test that shows, no, SQL doesn’t give half a care about that. It creates the same histogram no matter what. 1000 rows should do the trick. I’m making both columns NOT NULL here, because I want to make one my PK, and I want to make sure there’s no histogram step for NULL values in the other. I’m not going to index my date column here, I’m going to let SQL generate statistics automatically.

SELECT
    ISNULL([x].[ID], 0) AS [ID] ,
    ISNULL(CAST([x].[DateCol] AS DATE), '1900-01-01') AS [HireDate]
INTO
    [dbo].[AutoStatsTest]
FROM
    ( SELECT TOP 1000
        ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) ,
        DATEADD(HOUR, [m].[message_id], GETDATE())
      FROM
        [sys].[messages] AS [m] ) [x] ( [ID], [DateCol] );

ALTER TABLE [dbo].[AutoStatsTest] ADD CONSTRAINT [pk_t1_id] PRIMARY KEY CLUSTERED ([ID]);

First, let’s check in on what values we have

I’m going to run one query that will generate a histogram, but it’s guaranteed to return all of the table data. I want to see what SQL comes up with for histogram hits and missing, here.

SELECT *
FROM [dbo].[AutoStatsTest] AS [ast]
WHERE [ast].[HireDate] >= '1900-01-01'

We have our histogram, and I’ll use a clunky DBCC command to show me to it. Below is a partial screen cap, up to a point of interest.

Pay attention to the rectangle.

Pay attention to the rectangle.

SQL created a histogram with direct hits on 04/30, and then 05/02. That means it doesn’t have a step for 05/01, but it does postulate that there are 22 rows with a date of 05/01 in the RANGE_ROWS column.

I went ahead and dropped that table and re-created it. Next we’ll run the same query, but we’ll pass in 05/01 as an equality value.

SELECT *
FROM [dbo].[AutoStatsTest] AS [ast]
WHERE [ast].[HireDate] = '2016-05-01'

And, long story short, it creates the exact same histogram as before.

Is this good? Is this bad?

Well, at least it’s reliable. I’m not sure how I feel about it otherwise.

You can try creating filtered indexes or statistics on really important segments of data, if you really need SQL to have granular information about it. Otherwise, you’ll have to trust in the secret, and sometimes not so secret sauce, behind the cardinality estimator.

Thanks for reading!

Brent says: the more I work with SQL Server, the more I’m filled with optimism about the oddest things. When I read Erik’s idea about the exact histogram step, though, I thought, “Nopetopus.”

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.


Stats Week: Messin’ With Statistics

$
0
0

If there’s one thing living in Texas has taught me

It’s that people are very paranoid that you may Mess With it. Even in Austin, where the citizenry demand weirdness, they are vehemently opposed to any form of Messing, unless it results in mayonnaise-based dipping sauce.

Me? I like Messing With stuff. Today we’re going to look at one way you can make SQL think your tables are much bigger than they actually are, without wasting a bunch of disk space that has nearly the same price as wall to wall carpeting.

To do this, we’re going to venture deep into the Undocumented Command Zone. It’s like the Twilight Zone, except if you go there on your production server, you’ll probably end up getting fired. So, dev servers only here, people.

Creatine

Let’s make a table, stuff a little bit of data in it, and make some indexes.

DROP TABLE [dbo].[Stats_Test]

;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT  ISNULL([N].[N], 0) AS [ID] ,
    ISNULL(CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE())),     '1900-01-01') AS [OrderDate] ,
        ABS(CONVERT(NUMERIC(18,2), (CHECKSUM(NEWID()) % 10000.00))) AS [Amt1]
INTO [Stats_Test]
FROM    [Numbers] [N]
ORDER BY [N];

CREATE UNIQUE CLUSTERED INDEX [cx_id] ON [dbo].[Stats_Test] ([ID])
CREATE UNIQUE NONCLUSTERED INDEX [ix_test1] ON [dbo].[Stats_Test] ([OrderDate], [ID])
CREATE UNIQUE NONCLUSTERED INDEX [ix_test2] ON [dbo].[Stats_Test] ([Amt1], [ID])

There’s our thousand rows. If you’re dev testing against 1000 rows, your production data better only have 1001 rows in it, or you’re really gonna be screwed when your code hits real data. How do we cheat and make our data bigger without sacrificing disk space?

Eat Clen, Tren Hard

You can update all statistics on the table at once, or target specific indexes with the following commands.

UPDATE STATISTICS [dbo].[Stats_Test] WITH ROWCOUNT = 10000000000, PAGECOUNT = 1000000000

UPDATE STATISTICS [dbo].[Stats_Test] ([cx_id]) WITH ROWCOUNT = 10000000000, PAGECOUNT = 1000000000
UPDATE STATISTICS [dbo].[Stats_Test] ([ix_test1]) WITH ROWCOUNT = 10000000000, PAGECOUNT = 1000000000
UPDATE STATISTICS [dbo].[Stats_Test] ([ix_test2]) WITH ROWCOUNT = 10000000000, PAGECOUNT = 1000000000

This will set your table row count to uh… 10 billion, and your page count to 1 billion. This makes sense, since usually a bunch of rows fit on a page. You can be more scientific about it than I was, this is just to give you an idea.

So let’s check in on our statistics! Sup with those?

DBCC SHOW_STATISTICS('dbo.Stats_Test', cx_id)

DBCC SHOW_STATISTICS('dbo.Stats_Test', ix_test1)

DBCC SHOW_STATISTICS('dbo.Stats_Test', ix_test2)

Hint: these commands will not show inflated page or row counts in them. They actually won’t show page counts at all. Hah. That’s kinda silly, though. Hm.

Anyway, what we should grab from the statistics histograms are some middling values we can play with. For me, that’s an ID of 500, a date of 2016-03-18, and an amount of 4733.00.

One thing I’ve found is that the inflated counts don’t seem to change anything for Identities, or Primary Keys. You’ll always get very reasonable plans and estimates regardless of how high you set row and page counts for those. Regular old clustered indexes are fair game.

Some really interesting things can start to happen to execution plans when SQL thinks there’s this many rows in a table. The first is that SQL will use a rare (in my experience) plan choice: Index Intersection. You can think of this like a Key Lookup but with two nonclustered indexes rather than from one nonclustered index to the clustered index.

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[ID] = 500

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[OrderDate] = '2016-03-18'

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[Amt1] = 4733.00

For these equality queries, we get the following plans:

Bizarre Love Parallel

Bizarre Love Parallel

SQL isn’t fooled with an equality on 500. We get a little plan. We’ll examine inequality plans in a moment. For now let’s look at the middle plan. That’s where the Index Intersection is occurring. The bottom plan has a regular Key Lookup.

Blood Everywhere.

Blood Everywhere.

The costs and estimates here are Banana Town crazy. And right there down the bottom, we can see SQL using the Clustered Index key to join our Nonclustered Indexes together. If you’ve been reading this blog regularly, you should know that Clustered Index key columns are carried over to all your Nonclustered Indexes.

If we switch to inequality queries, well…

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[ID] > 500

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[OrderDate] > '2016-03-18'

SELECT *
FROM [dbo].[Stats_Test] AS [st]
WHERE [st].[Amt1] > 4733.00
All Hell Breaks Loose

All Hell Breaks Loose

The top query that SQL wasn’t fooled by before now has the same insane estimates as the others. Our two bottom queries get missing index requests due to the amount of work the Index Intersection takes.

It’s happening because of the SELECT * query pattern. This will go away if we stick to only selecting columns that are in our Nonclustered Indexes. For example, SELECT ID will result in some pretty sane index seeks occuring. The estimated rows are still way up there.

Unfortunately, STATISTICS TIME and IO are not fooled by our statistical tomfoolery.

Womp womp womp

Womp womp womp

They use reality-based measurements of our query activity. This trick is really only useful to see what happens to execution plans. But hey it’s a lot cheaper, easier, and faster than actually inserting 10 billion rows.

So what?

Like a grandma in a grocery store, SQL Server makes all its decisions based on cost. Whatever is cheapest is choice. If SQL Server were a person, it would probably wash and dry used aluminum foil, save old bread ties, and use clothes pins for the right thing.

I forget what I was going to say. Probably something smart about testing your queries about sets of data commensurate to what you have in production (or larger) so that you don’t get caught flatfooted by perf issues on code releases, or if your company finally starts getting customers. This is one technique to see how SQL will treat your code as you get more rows and pages involved.

Just don’t forget to set things back when you’re done. A regular stats update will take care of that.

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Stats Week: Only Updating Statistics With Ola Hallengren’s Scripts

$
0
0

I hate rebuilding indexes

There. I said it. It’s not fun. I don’t care all that much for reorgs, either. They’re less intrusive, but man, that LOB compaction stuff can really be time consuming. What I do like is updating statistics. Doing that can be the kick in the bad plan pants that you need to get things running smoothly again.

I also really like Ola Hallengren’s free scripts for all your DBA broom and dustpan needs. Backups, DBCC CHECKDB, and Index and Statistics maintenance. Recently I was trying to only update statistics, and I found it a little trickier than I first imagined. So tricky, in fact, that I emailed Ola, and got a response that I printed and framed. Yes, the frame is made out of hearts. So what?

What was tricky about it?

Well, the IndexOptimize stored procedure has default values built in for index maintenance. This isn’t a bad thing, and I could have altered the stored procedure, but that would be mean. I set about trying to figure out how to get it to work on my own.

First, I tried only passing in statistics parameters.

EXEC [master].[dbo].[IndexOptimize]
    @Databases = N'USER_DATABASES' ,
    @UpdateStatistics = N'ALL' ,
    @OnlyModifiedStatistics = N'Y' ,
    @LogToTable = N'Y';

But because of the default values, it would also perform index maintenance. Sad face. So I tried being clever. Being clever gets you nowhere. What are the odds any index would be 100% fragmented? I mean, not even GUIDs… Okay, maybe GUIDs.

EXEC [master].[dbo].[IndexOptimize]
    @Databases = N'USER_DATABASES' ,
    @FragmentationMedium = N'INDEX_REORGANIZE' ,
    @FragmentationHigh = N'INDEX_REORGANIZE' ,
    @FragmentationLevel1 = 100 ,
    @FragmentationLevel2 = 100 ,
    @UpdateStatistics = N'ALL' ,
    @OnlyModifiedStatistics = N'Y' ,
    @LogToTable = N'Y';

But This throws an error. Why? Well, two reasons. First, 100 isn’t valid here, and second, you can’t have the same fragmentation level twice. It would screw up how commands get processed, and the routine wouldn’t know whether to use @FragmentationMedium, or @FragmentationHigh. This makes sense.

Okay, so I can’t use 100, and I can’t set them both to 99. What to do? Let’s bring another parameter in: @PageCountLevel.

EXEC [master].[dbo].[IndexOptimize]
    @Databases = N'USER_DATABASES' ,
    @FragmentationMedium = N'INDEX_REORGANIZE' ,
    @FragmentationHigh = N'INDEX_REORGANIZE' ,
    @FragmentationLevel1 = 98 ,
    @FragmentationLevel2 = 99 ,
    @UpdateStatistics = N'ALL' ,
    @OnlyModifiedStatistics = N'Y' ,
    @PageCountLevel = 2147483647,
    @LogToTable = N'Y'; 

This seems safe, but it’s still not 100%. Even with the integer maximum passed in for the page count, it still felt hacky. Hackish. Higgity hack. The other part of the equation is that I don’t even want this thing THINKING about indexes. It will still look for indexes that meet these requirements. If your tables are big, you know, sys.dm_db_index_physical_stats can take foreeeeeeeeeeeeeeeeeeeeeeeeeeeever to run. That seems wasteful, if I’m not going to actually do anything with the information.

Hola, Ola

This is where I emailed Ola for advice. He responded pretty quickly, and here’s how you run stats only updates.

EXECUTE [dbo].[IndexOptimize]
    @Databases = 'USER_DATABASES' ,
    @FragmentationLow = NULL ,
    @FragmentationMedium = NULL ,
    @FragmentationHigh = NULL ,
    @UpdateStatistics = 'ALL' ,
    @OnlyModifiedStatistics = N'Y' ,
    @LogToTable = N'Y';

Moral of the story

NULLs aren’t all bad! Sometimes they can be helpful. Other times, developers.

Thanks for reading!

Brent says: Subtitle: How many DBAs does it take to think of NULL as a usable option? Seriously, we all banged our heads against this one in the company chat room.

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

A quick tip for working with large scripts in SSMS

$
0
0

Sometimes great things stare you in the face for years

Sometimes they stare at you for so long that you stop noticing them. This isn’t a romdramadey tag line. There are just so many buttons to push. Sometimes pressing them is a horrible idea and it breaks everything. I lose my mind every time I go to move a tab and I end up undocking a window. This won’t save you from that, but it will at least save your mouse scroll wheel.

This has helped me out a whole bunch of times, and especially, recently, when contributing code to our Blitz* line of stored procedures. Navigating all around large scripts to change variables or whatever is a horrible nuisance. Or it can be. When the stores procedure is like 3000 lines and there’s a bunch of dynamic SQL and… yeah. Anyway. Buttons!

Personality Crisis

There’s a little button in the top right corner of SSMS. Rather unassuming. Blends right in. What is it? Fast scroll? Some kind of ruin all your settings and crash your computer button? Delete all the scripts you’ve been working on for the past 6 months?

No! It’s a splitter, screen splitter! Guaranteed to blow your mind! Anytime!

This is a huge picture. Sorry.

This is a huge picture. Sorry.

If you drag it up and down, you can alter the visible portion of the screen, and scroll and zoom in each pane independently.

Now there is world peace.

Now there is world peace.

There you have it

Next time you need to work with a huge script and find yourself scrolling around like a lunatic, remember this post!

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

One weird trick for managing a bunch of servers

$
0
0

Let’s face it, most people don’t have just one SQL Server

How many they tell Microsoft they have is another matter, but let the record show that I don’t condone licensing dishonesty. But going one step further, most places… Well, they’re ‘lucky’ if they have one DBA, never mind a team.

Everyone else: Give me your network people, your sysadmin, your huddled SAN group yearning to breathe free, the wretched refuse of your teeming developers.

Doing things on one server is aggravating enough. Doing things on a bunch of servers is even worse. Given some of today’s HA/DR features (I’m looking at you, Availability Groups, with your lack of a mechanism to sync anything outside of user databases. Rude.) people are more and more likely to have lots of SQL Servers that they need to tend to.

Sometimes just keeping track of them is impossible. If you’re one guy with 20 servers, have fun scrolling through the connection list in SSMS trying to remember which one is which. Because people name things well, right? Here’s SQLVM27\Instance1, SQLVM27\Instance2, SQLVM27\Instance3, and that old legacy accounting database is around here somewhere.

Register it and forget it

But don’t actually forget it. If you forget it and it goes offline, people will look at you funny. Turns out people don’t like offline servers much.

So what’s someone to do with all these servers? Register them! Hidden deep in the View menu of SSMS is the Registered Servers window

Hi there, handsome.

Hi there, handsome.

It will look pretty barren at first, just an empty folder. But you’ll fill it up quick, I’m sure. Can never have enough servers around, you know.

It’s pretty easy to populate, you can right click on the Local Server Group folder, or on servers you’re connected to in Object Explorer.

This

This

That

That

 

 

 

 

 

 

 

Either way, you get the same dialog box to add a server in. You can give it a friendly name if you want! Maybe WIN03-SQL05\Misc doesn’t tell a good story.

Joy of joys

Joy of joys

And if you hip and hop over to the Connection Properties tab, you can set all sorts of nifty stuff up. The biggest one for me was to give different types of servers different colored tabs that the bottom of SSMS is highlighted with. It’s the one you’re probably looking at now that’s a putrid yellow-ish color and tells you you’re connected and that your query has been executing for three hours. Reassuring. Anyway, I’d use this to differentiate dev from prod servers. Just make sure to choose light colors, because the black text doesn’t show up on dark colors too well.

Wonder of wonders

Wonder of wonders

Another piece of advice here is not to mix servers on different major (and sometimes minor) versions. The reason is that this feature gives you the ability to query multiple servers at once. If you’re looking at DMVs, they can have different columns in them, and you’ll just get an error. Even a simple query to sys.databases will throw you a bonk between 2012 and 2014.

By the planets!

By the planets!

I changed my mind. I hate planets.

I changed my mind. I hate planets.

Even if you’re running 2008R2, there are some pretty big differences in DMVs between SP1 and SP3. Microsoft has been known to change stuff in CUs (I’m looking at you, Extended Events).

On the plus side, you can use your multi-server connection to SELECT @@VERSION to help you decide how you should group them. If they have something better in common, like participating in Log Shipping, Mirroring, an AG, etc., all the better.

Insight

Insight

But my favorite thing, because I was a devotee to the Blitz line of stored procedures even before I got paid to like them, was that I could install them on ALL OF MY SERVERS AT ONCE! This was especially useful when updates came out. You know what it’s like to put a stored proc on 20 servers one at a time? Geeeeeet outta here!

Peanuts.

Peanuts.

Check that out. It’s on both of my servers. At once. That means simultaneously, FYI. If you have a DBA or Admin database that you keep on all your servers to hold your fancy pants scripts and tools, this is an awesome way to make sure they all have the latest and greatest.

You’re already better at your job

Even though this feature came out in 2008, I hardly see anyone using it. I found it really helpful comparing indexes and query plans across app servers that held different client data across them. It also exposes far less than Linked Servers; you need to worry less about access and level of privilege.

Just don’t forget to export your list if you change laptops!

Thanks for reading!

To celebrate our five year anniversary, get 50% off our online courses with the coupon code HighFive. Shop now.

Where Clauses and Empty Tables

$
0
0

Sometimes SQL is the presentation layer

And when it is, you end up doing a lot of concatenating. This isn’t about performance, or trying to talk you out of SQL as the presentation layer, this is just something you should keep in mind. SQL is a confusing language when you’re just starting out. Heck, sometimes it’s even confusing when you’ve been doing it for a long time.

Let’s say your have a website that stores files, and when a user logs in you use a temp table to track session actions as a sort of audit trail, which you dump out into a larger table when they log out. Your audit only cares about folders they have files stored in, not empty ones.

Here’s a couple tables to get us going.

IF OBJECT_ID('tempdb..#aggy') IS NOT NULL
DROP TABLE #aggy;

WITH x1 AS (
SELECT TOP (100)
ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) AS ID
FROM sys.[messages] AS [m], sys.[messages] AS [m2])
SELECT ID, 
    DATEADD(DAY, [x1].[ID] * -1, CAST(GETDATE() AS DATE ) ) [CreateDate],
    'C:\temp\' + CAST(HASHBYTES('MD5', NCHAR([x1].[ID])) AS VARCHAR(32)) + '.gif' [Path]
INTO #aggy
FROM [x1];

IF OBJECT_ID('tempdb..#usersessioninfo') IS NOT NULL
DROP TABLE #usersessioninfo;
CREATE TABLE #usersessioninfo 
(LastActionID INT IDENTITY(1,1), UserID INT, UserMessage VARCHAR(100), MessageDetails VARCHAR(100))

And then we’ll stick some data into our session table like this.

INSERT [#usersessioninfo]
( [UserID] , [UserMessage] , [MessageDetails] )
SELECT 
@@SPID AS [UserID],
'Welcome to your folder!' AS [UserMessage],
'You have stored #' +
CAST(COUNT(*) AS VARCHAR(100)) +
' files in the last 30 days, starting on ' + 
CAST(MIN([a].[CreateDate]) AS VARCHAR(20)) + 
' ending on ' +
CAST(MAX([a].[CreateDate]) AS VARCHAR(20)) +
'.' AS [MessageDetails]
FROM [#aggy] AS [a]
WHERE [a].[CreateDate] >= GETDATE() -30

Everything looks great!

Select max blah blah blah

Select max blah blah blah

But if your table is empty…

You may find yourself with a bunch of junk you don’t care about! Empty folders. Contrived examples. Logic problems. Stay in school.

TRUNCATE TABLE [#aggy]

INSERT [#usersessioninfo]
( [UserID] , [UserMessage] , [MessageDetails] )
SELECT 
@@SPID AS [UserID],
'Welcome to your folder!' AS [UserMessage],
'You have stored #' +
CAST(COUNT(*) AS VARCHAR(100)) +
' files in the last 30 days, starting on ' + 
CAST(MIN([a].[CreateDate]) AS VARCHAR(20)) + 
' ending on ' +
CAST(MAX([a].[CreateDate]) AS VARCHAR(20)) +
'.' AS [MessageDetails]
FROM [#aggy] AS [a]
WHERE [a].[CreateDate] >= GETDATE() -30

What do you think is going to happen? We truncated the table, so there’s nothing in there. Our WHERE clause should just skip everything because there are no dates to qualify.

NULLs be here!

NULLs be here!

Darn. Dang. Gosh be hecked. These are words I really say when writing SQL.

That obviously didn’t work! You’re gonna need to do something a little different.

Having having bo baving banana fana fo faving

One of the first things I was ever really proud of was using the HAVING clause to show my boss duplicate records. This was quickly diminished by him asking me to then remove duplicates based on complicated logic.

Having is also pretty cool, because it’s processed after the where clause, so any rows that make it past there will be filtered out later on down the line. For our purposes, it will keep anything from being inserted, because our COUNT is a big fat 0. Zero. Zer-roh.

INSERT [#usersessioninfo]
( [UserID] , [UserMessage] , [MessageDetails] )
SELECT 
@@SPID AS [UserID],
'Welcome to your folder!',
'You have # ' +
CAST(COUNT(*) AS VARCHAR(100)) +
' files, starting on ' + 
CAST(MIN([a].[CreateDate]) AS VARCHAR(20)) + 
' ending on ' +
CAST(MAX([a].[CreateDate]) AS VARCHAR(20)) +
' in the last 30 days.'
FROM [#aggy] AS [a]
WHERE [a].[CreateDate] >= GETDATE() -30
HAVING COUNT(*) > 0

This inserts 0 rows, which is what we wanted. No longer auditing empty folders! Hooray! Everybody dance drink now!

Mom will be so proud

Not only did you stay out of jail, but you wrote some SQL that worked correctly.

Thanks for reading!

Viewing all 370 articles
Browse latest View live