The SQL Server High Performance Series

Inefficiency is a gluttonous thief. It burglarizes your server rooms at all hours of the day and night, demanding virtually limitless hardware sacrifices to satiate its endless thirst for clock cycles and disk rotations. In return it punishes your users with reduced performance and reduced satisfaction, and devastates your solution's scalability.

This inefficiency, materialized in the form of slow performance, is one of the primary causes of system abandonment. This is particularly troublesome in the SQL Server world where many systems servicing large user bases often run on low cost server boxes that leave little margin for performance waste. Many organizations have tossed out their SQL Server solution running on a $3000 PC because the performance wasn’t satisfactory (not achieving so-called “Enterprise” performance), to replace it with a multi-million dollar mainframe solution, overcoming embarrassing inefficiency with brute force.

Reward

Several years back, in a moment of nerdish bravado, I made a foolish blanket statement that I could reduce the runtime of virtually any element of a non-trivial SQL Server database solution by 95% (thus improving the performance by about 20x), doing so through some rudimentary changes requiring nothing more than some analysis, minor code changes (changing the underlying code, but not the functionality), indexing, and file group changes. To my surprise, and even greater dismay, this number actually proved to be remarkably accurate: From giant multi-hour organization wide reports, to simple security procedures run hundreds of times a minute, the obvious low hanging fruit alone often improved performance by 10x or more. With a little bit of elbow grease it has proven extraordinarily common to improve performance by 20x or more, significantly improving responsiveness and load handling of the respective systems at minimal cost.

The remarkable thing is that these weren’t systems implemented by bad developers – many of them were extraordinary developers who implemented a lot of tricks and techniques that I’ve co-opted and added to my own bag of techniques. Instead there seems to be a dearth of real information on developing for performance in SQL Server, leaving many to guess about the best approach, not to mention that there isn’t enough attention paid to performance efficiency in enterprise solutions. Many seem to be under the false impression that gross inefficiency requiring massive clusters to perform trivial tasks merits a capital-E Enterprise designation.

Motivation

In software development there’s an oft-referenced vice known as “premature optimization”. This is the tendency to prematurely focus on code performance while code is still young and awkwardly growing, before the critical performance weaknesses have been identified and measured. The end result of this misguided effort is often convoluted code that is difficult to understand and maintain (for instance code including inline assembly or using specialized system hacks in seldom called edge functions). This is often a mistake of inexperienced programmers that haven’t had the perfectionist engineering streak beaten out of them.

Consider also that performance truly isn't a concern for the vast majority of code in most client-side applications – it likely doesn't matter if the code that validates an input box in a Windows Forms application takes 3ms or 70ms to complete. As the processing is decentralized and isn’t impacting other users who might be running the application elsewhere on the planet, it is basically making use of “free” clock cycles available on the client PC, and generally is imperceptible to the user. If one thousand different users were running the application simultaneously, they’re running it on a thousand powerful PCs, effectively throwing a massive “cluster” at the problem. In other words, you can overcome application inefficiency on the client side through massive computational excess and a endless ability to scale-out. Even in cases where there is worthwhile performance issues identified, for example an image processing algorithms that takes several seconds to perform an operation, it's often best to wait until the project nears a release and the code has settled, at which point you can send a commando performance team to profile and then selectively improve the slowest sections of code that will have the most beneficial impact, focusing on the lowest hanging fruit, yielding a bounty of quick wins. (Taking one for the team because there's no I in team, and no cliché unworthy)

Enterprise databases, or any centralized system for that matter, are entirely different beasts – performance is one of the critical elements of these systems, and performance problems are one of the primary reasons why solutions are abandoned or re-architected. Consider that every clock cycle wasted on a shared resource, such as a database server, impacts the performance of the overall system and every other user. In most environments there is a massive asymmetry between the computational capability of client machines, and the computational capability of a shared system, such as a database server. There are usually some fixed financial and technological limits to the amount of hardware that a system can scale to, so your database server running on a lowly Dell two-way server is desperately trying to keep up with the demands of 500 user workstations pounding away at it. Even though Google is clustered on purportedly thousands of machines, they still have to develop efficiently to be able to economically service millions of users in a timely manner.

Thus, while it might seem irrelevant when taken alone that your stored procedure saturates the resource, taking 200ms to return a simple list of values to populate a drop list for Joe User, imagine 100 users all opening that form at the same time putting a shared demand on the database system. The performance impact starts to become significant and adversely affects the usability (and credibility) of the system. This is exacerbated by the fact that simultaneous performance demands aren’t merely additive on shared resources, but rather contention and task sharing often means that these issues snowball into much more than the sum of the parts.

Wisdom

You should consider the performance of your database from day one with every table you add, every index you create, every trigger you concoct, and every relationship you define. While the misguided will argue that this amounts to premature optimization (as Ralph Waldo Emerson observed, a foolish consistency is the hobgolbin of little minds, and the belief that any performance concerns are premature is just such a foolish consistency), the reality is that the performance of a database system is largely defined by the fundamental design of the system, and as the system grows it becomes much more difficult and costly to solve fundamental performance problems. Furthermore, once an enterprise system reaches production even the simplest performance change, such as adding an index, requires complex analysis to determine how it impacts other parts of the system, or that it satisfies what could be hundreds of procedures accessing the object.

The cynical will wonder how one can predict the future when designing a database system, but the reality is that the access patterns are usually obvious by the time you're starting designing tables – you know how the tables relate, what data will be searched, how often you'll be selecting the records versus modifying them, and how big the fields and records should be. Use this information effectively when developing the tables to choose the appropriate clustered and secondary indexes, to minimize the size of each record, and to write efficient SQL. Don't leave it for a maintenance programmer to reverse engineer the system and apply best guesses in a moment of crisis in the future.

The Basics? We’re All Professionals Here!

Much of the material covered in this outing will be old hat for a lot of developers, but is nonetheless worth a recap – Even among the pros there remains misunderstandings and conflicting information about the fundamentals of databases, and the true magnitude of impact they have on systems. I’ve intentionally authored this series conversationally, as opposed to a “high impact hit list!”, however if you’d just like a brief summary list you can find one at the end.

Data Size

This needs to be addressed as there is a growing camp of “data wasters” that erroneously believe that the larger the amount of waste, the more Enterprise Ready a solution is (capitalization used derisively). I step into this quagmire knowing full well that this section will yield me some “you’re a dummy!” responses from some newly earned adversaries, however that’s a price I’m willing to pay if I can save but one byte tree.

Minimize the size of your data. Don’t use GUIDs where they aren't necessary (e.g. where you don’t really need global uniqueness/replication), and don’t use a bigint where an int or a smallint will suffice. Don’t use a smallint where a tinyint will suffice or an nvarchar where a varchar would be fine. Use the smallest type that is reasonable for the field. Don’t invent vague packing technicalities or native type size issues in an attempt to justify oversized data.

Of course you should plan for realistic growth, and I’m not advocating that you use a tinyint to store your CustomerID field, but keep the rational, real world in mind when designing your applications – are you really going to exceed 2 billion users? Is there going to be more than 32767 languages in your application? Is it likely that we’re going to a new calendaring system that might have 2 billion months?

Evaluate if it might be an acceptable compromise to simply use these large types in your façade while actually storing smaller types in the actual database. This would give you improved performance, and would allow you to easily upsize your data types in the future in the unlikely event that it becomes necessary.

Clearly there are cases where large data types are legimately warranted, however too many database architects abscond themselves of any responsibility for efficiency by making everything a GUID or a bigint “just in case” (GUIDs have a substantial creation cost as well, in addition to the obvious storage and I/O costs. While GUIDs once used the available network card MAC address as the foundation and generally sequentially increased in value on each new GUID, in current Windows variants GUIDs are basically random numbers — used as a clustered primary key they can lead to endless data reordering).

Why does it matter? In real enterprise apps before you know it there are tens or hundreds of millions of rows throughout your database, and these rows need to be read from, and written to, the glacially slow storage subsystem constantly – given this finite resource, doesn’t it seem logical that 1MB of I/O carrying 20,000 records is better than 1MB of I/O carrying only 5,000 records? Of course it is. Isn’t it better that 10% of your database can fit in the memory cache rather than just 5%? Of course.

 

Don’t be lulled into a false justification of large data types by running ridiculously small benchmarks, where all of the data exists in the memory cache and the I/O is dwarfed by the computational element of the query, yielding "only" a performance hit of 10% or so with larger types – when your database gets to real enterprise size, size really does matter. That cluster indexed GUID primary key not only makes the row bigger, it makes every non-clustered index bigger (and thus slower) as well, and when the weakest link SAN is running at 100%, you'll regret every wasteful byte.

Indexes

Many SQL Server performance problems are rooted in missing or inappropriate indexes, or alternately unused indexes. This is often true for databases thrown together by front-end experts unhappily tasked with supplying the back-end database, just as it’s often the case for those carefully crafted by highly-focused database professionals.

An understanding of indexes, and a focus on their application, is paramount for high performance databases. Not only is it critical to create the right indexes, it’s important to craft your access to properly utilize the indexes that are there.

A Bookshelf Full of Examples

Non-Clustered Indexes

Indexes in reference books serve the same purpose (and share many of the same traits) as those in the database world – by referencing the index you can follow a shortcut to a particular piece of information, seeking directly to that specific page, versus going from page to page scanning the contents. In the SQL Server world these sorts of indexes are called non-clustered indexes (or secondary indexes) – they are a subset of the data ordered for a specific purpose, containing a pointer to where the real data row can be found.

In the Northwind database an example of a non-clustered index is ShipPostalCode on the Orders table. This index sorts by the ShipPostalCode, and may be used for a query such as the following.

SELECT * FROM Orders WHERE ShipPostalCode = '05022'

If you take a look at the execution plan (Ctrl-K or choose the option “Show Execution Plan” under the Query drop-down in Query Analyzer. When you execute a query with this option enabled a new tab, Execution Plan, will appear beside the results tab), you can see that an index seek took place, and then a bookmark lookup to find the actual data row. If we run SET STATISTICS IO ON on the connection before running the above query, we’ll get some statistics on the IO used to satisfy the query, which will be as follows.

Table 'Orders'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

Compare this to the following query, which simulates having no index.

SELECT * FROM Orders WITH(INDEX(0)) WHERE ShipPostalCode = '05022'

In this case the execution shows a full table scan, and our IO statistics reports the following.

Table 'Orders'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.

Without the index there was significantly more I/O, and the differential would be vastly worse if this were a large enterprise-sized table instead of a small sample table. To make matters much, much worse such a table scan will block on every single exclusive page or row lock on the table, waiting for the data to be unlocked just to be able to verify if the wanted information is contained within, while the index seek knows that the locked data isn't the data that it's looking for and is unaffected. Try running the above two queries in two separate query analyzer windows while the following script is running in yet another (increase the WAITFOR delay if you can't jump between them all within 30 seconds).

BEGIN TRAN

UPDATE Orders SET OrderDate = '1997-08-27'
WHERE OrderID = 10647

WAITFOR DELAY '00:00:30'

ROLLBACK TRAN

The first query, using the index, instantly returns the result regardless of the row lock, which is logical given that the row being locked is not what the query is looking for, while the second query, not using the index (which can happen because no index exists, or the index isn’t deemed the best choice) blocks until the other connection’s lock is released. Not using an index is not only vastly less efficient, it can significantly worsen blocking problems as a database scales (or rather tries to scale).

Clustered Indexes

Returning to our analogy, many books take it a step further and order the content, making the data itself an index of sorts. A cookbook might sort by main category and then dish name, while a phone book famously sorts by [city, last name, first name] (in that order). Thus if you want to search based upon the sorted data, it’s extremely efficient – in the example of the phone book you can very quickly seek to the desired city and last name, scanning a small number of records for the desired person. In the SQL Server world this sort of index is called a clustered index (the sorting of the data itself), and for obvious reasons you can only have one clustered index in a book, or on a table. The primary benefit of a clustered index is that all of the table data is immediately available for every index match – no dereferencing is necessary.

In the Northwind database consider the following query.

SELECT * FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID WHERE Orders.ShipPostalCode = '05022'

If you look at the execution plan, the Order Details data is grabbed via a very efficient clustered index seek.

Clustered indexes aren’t all milk and honey, though. For instance imagine that you’re the hard working typesetter maintaining the layout of the phone book, and you’ve carefully arranged all of the entries on the respective pages. Every time a new entry comes in and doesn’t coincidentally fit right at the end of the sort order, or someone changes the information on an existing row in a way that alters the sort order (“Smith” changes his name to “Jones”), you need to reorganize some pages to make space. This same data-churn problem occurs with both non-clustered and clustered indexes, but clustered indexes exacerbate the problem given that it contains the entirety of the row data.

Of course you could plan for this by keeping a bit of blank space on each of your pages to facilitate at least a couple of changes, which is what the fillfactor is used for in SQL Server (a lower fillfactor leaves more empty space but reduces the true data density – insert performance is improved, but read performance is diminished. A high fill factor increases the real data density and thus read performance, but increases the likelihood of inserts requiring page splits. Note that fillfactor only applies on index creation, and whenever you defragment/rebuild your indexes), however this can be a serious performance issue for out-of-order inserted data, or frequently changing cluster indexed fields. It’s for this reason that many developers use the monotonically increasing identity field as their cluster index. Historically there was a worry that having multiple inserts all going to the same "place" in the index, at the end, would lead to scalability killing contention at this hot spot. SQL Server has logic to deal with identity fields and effectively eliminates this hot spot issue.

Another problem with clustered indexes is their girth (they contain the entire row data). This is largely irrelevant if you’re seeking to specific records, or where you actually plan on using all of the data after a lookup, however if you are querying a range of data (for instance the first name of all of the people with the last name “Forbes” in the city of “Oakville”) the query engine will read in the entire row contents for each matching record using a range scan, extracting only the requested data. In our example phone book there is so little extra data that it’s a minor overhead, however in many large real-world tables this can have a serious performance impact.

Consider if instead we had a secondary index that was sorted by City, Last Name, First Name, the query engine could very efficiently scan past only the small index entries.

Covering Indexes

This brings up a very important point – Some indexes contain enough information that you don’t even need to go to the content, your query being satisfied by the index itself. Consider an index in a tour book that sorts famous attractions by their name, and the country and city that they were located in, pointing to the page where further information regarding it could be found. If you just want to know what city the Accademia dell' Arte del Disegno in Italy is found in, a quick seek through the index will tell you that it’s in Florence. In this instance the index was a “covering index”, in that it fully covered our request and we didn’t need to dereference to the complete topical information. This is often the most efficient query mechanism of all.

Consider the following variation of a query we ran earlier.

SELECT OrderID FROM Orders WHERE ShipPostalCode = '05022'

This will efficiently use the ShipPostalCode index to find the specific record, and because the query is fully satisfied by the index itself, the costly bookmark lookup is avoided, and IO is minimal.

Table 'Orders'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.

The observant will note that OrderID isn’t actually in the index ShipPostalCode, or at least it doesn’t appear to be. The trick is that all of the fields that are the sort fields for the clustered index, if one is defined, are automatically added as data fields to every other index on the table. This can be blessing in cases where you want one of the clustered fields and suddenly the non-clustered index is a covering index, but it also needs to be weighed against the fact that it makes every other index larger, and thus less “data dense”.

Small covering index seeks are the most efficient method of pulling data, and it’s a good reason to ensure that you are only pulling the specific fields that you actually need from any given table, preferably with a covering index. Range scans are also highly efficient in many situations, and are usually used when the wanted entries in an index are consecutive, such as when you search BETWEEN two dates against a date index, though because of the previously mentioned bookmark lookup costs range scans are generally only seen if the index is fully covering, or against the clustered index.

Even in the case where a full scan is necessary, indexes might still be fully covering. Consider if the following index were added to the Customer table.

CREATE INDEX CountryCity ON Customers(Country,City,Address)

This index of course contains Country, City, Address, but as mentioned above it also contains CustomerID because it’s in the clustered index. Of course if we query on Country, or Country and City, or Country, City, and Address, an efficient seek or range scan might be used to pull the matching records.

SELECT CustomerID FROM Customers WHERE Country = 'Canada'

What if instead we wanted to search only on the address? In that case the index can’t be searched in order because it sorts by country and then city and address, and thus a particular address could exist anywhere in the index.

SELECT CustomerID FROM Customers WHERE Address = '43 rue St. Laurent'

You might be surprised to see that it still used the index, albeit this time it’s an inefficient scan rather than a seek. The index was used because it covered the query (all predicates and returned columns), and because the index is only a subset of the data it requires less I/O to scan the entire index than it does to scan the entire table data.

Statistics and Bookmarks

In a prior example we ran a query that required a bookmark lookup to satisfy the query (thus it did not have a covering index). The query was as follows.

SELECT * FROM Orders WHERE ShipPostalCode = '05022'

If you look at the execution plan for this query you can see that it seeks the “lookups” in the index, and then does a bookmark lookup against the clustered index (which is the actual table data). In this case there is only a single row to return, but even still the bookmark lookup cost is estimated to account for 50% of the cost of the query.

The cost of bookmark lookups, where an item is found in the index but it isn’t a covering index, is the reason why many people are surprised to find that SQL Server has ignored what they believe are perfect indexes and instead table scanned (“Why isn’t it using my index! ARGHHH!!!”). Consider the following query.

SELECT * FROM Orders WHERE ShipPostalCode = '24100'

Looking at the query plan you can see that it actually did a cluster index scan (which is a table scan on a table with a clustered index) instead of using our index, and the subtree cost is 0.0530.

This might seem perplexing because we seem to have a perfectly satisfactory index, however let’s do the same query again, this time using a query hint to force it to use our index.

SELECT * FROM Orders WITH(INDEX(ShipPostalCode)) WHERE ShipPostalCode = '24100'

If you look at the query plan you can see that it used our index, as demanded, but this time the bookmark lookups account for 80% of the query time. Our subtree cost comes in at 0.0564 – more than it was doing a table scan!

In this case this was only 10 records of a total of 830 (1.2%) yet still it opted to do a full table scan rather than using our index. Many developers have been perplexed in this situation, wondering why SQL Server was avoiding their beautiful index, but it’s doing it for a very valuable reason – it was cheaper than indirectly looking up each piece of data through bookmarks.

Of course we could have avoided bookmark lookups by using the index as a covering index if possible by using a query like the following, presuming this was all the data we needed to extract from the table.

SELECT OrderID FROM Orders WHERE ShipPostalCode = '24100'

Now it uses our index, is super efficient, and has a subtree cost of only 0.0064. In larger databases the difference can be the tremendous by avoiding both the bookmark lookup and the table scan.

So how did the query engine guess how many rows would match a criteria to choose which method (whether by index and bookmark lookup, or table scan) to use to most efficiently satisfy the query? That’s where something called distribution statistics comes into play. Statistics are a representative set of the data that are used by the query engine to make a best-guess plan for how to most efficiently serve the data. You can view the statistics for a given index, in the following case for the index ShipPostalCode, via the DBCC SHOW_STATISTICS command.

DBCC SHOW_STATISTICS('Orders','ShipPostalCode')

Due to the limited number of discrete ShipPostalCode values in the table, the statistics are entirely accurate in this case. In a more realistic database, with thousands or millions of rows, statistics start to become much more of an estimation (with an ever increasing margin of error). These estimations can lead to entirely wrong assumptions by the query engine in some edge cases, such as where it thinks a given set of predicates will yield thousands of rows when really it might yield only a couple.

Statistics can also fail for multi-field indexes. In this case the selectivity of the first field is used, so in the case of the index we created earlier (Country, City, Address), due to the fact that the country has a low selectivity (there are lots of entries for each country), the index will often be ignored, even though the city and address combination is highly unique. For this reason it is generally recommended that the most selective field comes first in your index, so in the case of that index the fields would be Address, City, and then Country. This is debateable because it also makes the index more single purpose – it no longer serves an efficient purpose for less-granular searches like just Country, or Country/City. This needs to be evaluated on a case by case basis, and truly wouldn’t be an issue for fully-normalized tables.

It should also be noted that ensuring that your statistics are as accurate as possible is critical. SQL Server includes automatic statistic updates, on by default, where it will attempt to do data sampling and update statistics when it feels they are out of date. Nonetheless it is a best practice to schedule full statistic updating at regular intervals (at a minimum weekly), preferably using the WITH FULLSCAN option so it is as accurate as possible. The standard database maintenance plan includes a step for statistic updating, and allows you to choose the amount of data to sample.

Regardless of all of the above, there will be cases where you may find that your statistics are up to date, your indexes are optimal, yet SQL Server is still incorrectly choosing not to use your index. In this case it may be an unfortunate reality that an index hint needs to be added to the query to politely (or rather sternly) request that it reconsider. Obviously this should be a last resort.

Actually Using Indexes

So you’ve created beautiful indexes, and you’ve ensured that your query only pulls the necessary data from each table, using covering indexes where possible to avoid costly bookmark lookups. You pull up the execution plan to find…that the query engine is entirely ignoring your index. There are several reasons why this could happen.

Consider the following query.

SELECT OrderID FROM Orders WHERE LEFT(ShipPostalCode,4) = '0502'

Fairly simple query, and from the looks of it one might think it’d be an efficient covered index seek. Upon execution you’ll discover that actually it was an inefficient scan. Consider the following instead.

SELECT OrderID FROM Orders WHERE ShipPostalCode LIKE '0502%'

In this case the query is executed as a highly efficient index seek. I have had cases where this tiny difference reduced an enterprise report from running for literally hours to a matter of seconds.

The reason is that in the former the indexed field was hidden within a function. The query engine can’t predict what the result of the function will be, so it’s forced to evaluate it for every row to see what pops out. LIKE is a first class comparison, it knows how it behaves, so the query engine can actually optimize against it. There are countless cases where people hide criteria fields in functions unnecessarily, and the result is massive, unexpected inefficiency.

The most common example of this mistake is using DATEADD/DATEDIFF to pull rows within a certain period of time – instead of pre-calculated a fixed demarcation (i.e. precalculating GetDate() – 3 years) and then doing a direct comparison with the row data, developers are forcing whole table scans with wasteful date computations on every single row. For instance consider a query to report news items that have occurred within the past 12 hours from a hypothetical news table .

DECLARE @CurrentTime

SET @CurrentTime = GetDate()

SELECT * FROM NewsStories WHERE DATEDIFF(hh,NewsDate,@CurrentTime)<12

Guaranteed to be terribly inefficient, yet it's overwhelmingly common. The query engine can much more effectively optimize the following variant.

DECLARE @StartTime

SET @StartTime = DATEADD(hh,-12,GetDate())

SELECT * FROM NewsStories WHERE NewsDate > @StartTime

Database Cheat Sheet

Indexed Computed Columns

At the outset I advocated that you minimize space usage (increasing real data density). The goal wasn’t to try to fit that database on a floppy disk, but rather to minimize the amount of I/O necessary to satisfy a given query, as I/O is the weakest link of most enterprise systems. There are design choices, such as adding additional indexes, that actually increase the size of your database on disk yet reduce the I/O necessary for certain queries, and these are usually very worthwhile trade-offs.

Another powerful technique you can use to trade disk space for improved database performance is indexed calculated columns. There are countless variations, but I’ll cover one scenario that is fairly commonly used – report counts by month for a given year. In the case of the Orders table this could be achieved via the following query.

SELECT YEAR(OrderDate) AS [Year], MONTH(OrderDate) AS [Month], COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE YEAR(OrderDate)=1997
GROUP BY YEAR(OrderDate), MONTH(OrderDate)

Instead of adhoc decomposing the date into month and year constituents, consider adding them as computed columns.

ALTER TABLE dbo.Orders ADD
            OrderDateYear  AS CONVERT(smallint,YEAR(OrderDate)),
            OrderDateMonth  AS CONVERT(tinyint,MONTH(OrderDate))

Now we can change our query to the following.

SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month], COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE OrderDateYear=1997
GROUP BY OrderDateYear, OrderDateMonth

By itself we’ve done nothing for the query efficiency (in fact it is actually less efficient as it’s applying the where predicate after building the set), though we’ve achieved a bit of “code re-use”. However we now have the foundations for some powerful indexed computed columns.

CREATE NONCLUSTERED INDEX IX_OrderDateDecomposed ON dbo.Orders
            (
            OrderDateYear,
            OrderDateMonth
            ) ON [PRIMARY]
GO

Now the query referencing these computed columns is dramatically more efficient. Even better, these indexed computed columns haven’t decreased the real data density of the table because they’re only materialized in the index. NOTE: Ensure that queries that don’t need these computed fields don’t pull them explicitly or implicitly via the wasteful * column selector, as it’ll unnecessarily calculate each of the computed fields for each row.

Indexed Views

[This functionality only exists in the Enterprise and Developer edition of SQL Server 2000]

An indexed view, sometimes referred to as a materialized view, is a sort of indexed computed columns on steroids, taking the idea of storing computed results to the next level. In a previous example we improved the performance of some data aggregation logic by grouping on some indexed computed columns. We can take a variant of that and create a view out of it.

CREATE VIEW dbo.OrdersByMonth
WITH SCHEMABINDING
AS
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month], COUNT_BIG(*) AS [Monthly Orders]
FROM dbo.Orders
GROUP BY OrderDateYear, OrderDateMonth

As it is the view is acting as nothing more than a template for queries against the underlying table, and is only of benefit for code reuse (which itself is a very worthwhile goal). We can take that a step further by materializing this view so the actual results are stored, and changes to the underlying table is automatically reflected in the aggregates. The following command creates the indexed view.

CREATE UNIQUE CLUSTERED INDEX IX_OrdersByMonth ON OrdersByMonth(Year,Month)

Now the following query will be satisfied by the indexed view, using the pre-computed values, rather than recalculating for every query.

SELECT [Year],[Month],[Monthly Orders] FROM OrdersByMonth WHERE Year=1997

In the end the resource usage for our monthly order count query has dropped by about 90% over the original query, and this is for a tiny sample database. In the real-world the differential can be extraordinary.

Indexed views do have some downsides, such as the automatic maintenance that occurs whenever the underlying data changes, however they can be an extraordinarily powerful tool in your arsenal and deserve further research if your platform supports it.

Summary

  • Keep your rows as small as possible to maximize real data density.
  • All tables should have a clustered index except in rare exception situations. Normally this will be a single field primary key.
  • Small clustered indexes keep non-clustered indexes small, increasing real data density.
  • Clustered indexes help make other indexes covering indexes.
  • Queries serviced by covering indexes are extremely efficient.
  • Avoid hiding criteria fields in functions – indexes will not be used for them.
  • Consider indexed computed columns where appropriate
  • If you shelled out the cash for Enterprise Edition, seriously evaluate how indexed views fit in your solutions
  • Index, index, index! Only in extremely rare cases are the additional update and insertion costs associated with maintaining indexes heavier than the benefits.
  • Understand execution plans. Evaluate them regularly.
Examples given in this series will reference the sample database “Northwind”. It can be installed via the script found on the Microsoft website
Add This! Blinkbits Blinklist Blogmarks BlogMemes BlueDot BlogLines co.mments Connotea del.icio.us de.lirio.us Digg Diigo DZone Facebook FeedMeLinks Folkd.com Fleck Furl Google Google Reader icio.de IndianPad Leonaut LinkaGoGo Linkarena Linkter Magnolia Mister Wong MyShare Ask.com MyStuff Ask.com Yahoo! MyWeb Netscape Netvouz Newsgator Newsvine Oneview.de RawSugar reddit Rojo Segnalo Shadows Simpy SlashDot Smarking Sphere Spurl Startaid StumbleUpon TailRank Technorati ThisNext yigg.de Webnews.de ReadMe.ru Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!

You must be logged in to post a comment.