Performance and tuning are built on top of good database design. They aren't panaceas. If you start with a bad database design, the information in the other chapters of this book may help you speed up your queries a little, but good overall performance starts with good design.
This chapter doesn't attempt to discuss all of the material presented in database design courses. It cannot teach you nearly as much as the many excellent books available on relational database design. This chapter presents some of the major design concepts and a few additional tips to help you move from a logical database design to a physical design on SQL Server.
Database design is the process of moving from real-world business models and requirements to a database model that meets these requirements. For relational databases such as SQL Server, the standard design creates tables in Third Normal Form.
When you translate an Entity-Relationship model, in Third Normal Form (3NF), to a relational model:
Relations become tables.
Attributes become columns.
Relationships become data references (primary and foreign key references).
Figure 2-1: Database design
![[IMAGE]](fig2_1.gif)
Based on access requirements and constraints, implement your physical database design as follows:
Denormalize where appropriate
Partition tables where appropriate
Group tables into databases where appropriate
Determine use of segments
Determine use of devices
Implement referential integrity of constraints
When a table is normalized, the non-key columns depend on the key, the whole key, and nothing but the key.
From a relational model point of view, it is standard to have tables that are in Third Normal Form. Normalized physical design provides the greatest ease of maintenance, and databases in this form are clearly understood by teams of developers.
However, a fully normalized design may not always yield the best performance. It is recommended that you design for Third Normal Form, and then, as performance issues arise, denormalize to solve them.
Each level of normalization relies on the previous level, as shown in Figure 2-2. For example, to conform to 2NF, entities must be in 1NF.
Figure 2-2: Levels of normalization
![[IMAGE]](fig2_2.gif)
When determining if a database is in a normal form, start with the assumption that the relation (or table) is not normalized. Then apply the rigor of each normal form level to it.
Normalization produces smaller tables with smaller rows:
More rows per page (less logical I/O)
More rows per I/O (more efficient)
More rows fit in cache (less physical I/O)
The benefits of normalization include:
Searching, sorting, and creating indexes are faster, since tables are narrower, and more rows fit on a data page.
You usually wind up with more tables. You can have more clustered indexes (you get only one per table) so you get more flexibility in tuning queries.
Index searching is often faster, since indexes tend to be narrower and shorter.
More tables allow better use of segments to control physical placement of data.
You usually wind up with fewer indexes per table, so data modification commands are faster.
You wind up with fewer null values and less redundant data, making your database more compact.
Triggers execute more quickly if you are not maintaining redundant data.
Data modification anomalies are reduced.
Normalization is conceptually cleaner and easier to maintain and change as your needs change.
While fully normalized databases require more joins, joins are generally very fast if indexes are available on the join columns. SQL Server is optimized to keep higher levels of the index in cache, so each join performs only one or two physical I/Os for each matching row. The cost of finding rows already in the data cache is extremely low.
The rules for First Normal Form are:
Every column must be atomic. It cannot be decomposed into two or more subcolumns.
You cannot have multivalued columns or repeating groups.
Every row and column position can have only one value.
The table in Figure 2-3 violates first normal form, since the dept_no column contains a repeating group:
Figure 2-3: A table that violates first normal form
![[IMAGE]](fig2_3.gif)
Normalization creates two tables and moves dept_no to the second table:
Figure 2-4: Correcting first normal form violations by creating two
tables
![[IMAGE]](fig2_4.gif)
For a table to be in Second Normal Form, every non-key field must depend on the entire primary key, not on part of a composite primary key. If a database has only single-field primary keys, it is automatically in Second Normal Form.
In the table in Figure 2-5, the primary key is a composite key on emp_num and dept_no. But the value of dept_name depends only on dept_no, not on the entire primary key.
Figure 2-5: A table that violates second normal form
![[IMAGE]](fig2_5.gif)
To normalize this table, move dept_name to a second table as shown in Figure 2-6.
Figure 2-6: Correcting second normal form violations by creating
two tables
![[IMAGE]](fig2_6.gif)
For a table to be in Third Normal Form, a non-key field cannot depend on another non-key field. The table in Figure 2-7 violates Third Normal Form because the mgr_lname field depends on the mgr_emp_num field, which is not a key field.
Figure 2-7: A table that violates Third Normal Form
![[IMAGE]](fig2_7.gif)
The solution is to split the Dept table into two tables, as shown in Figure 2-8. In this case, the Employees table, shown in Figure 2-4 already stores this information, so removing the mgr_lname field from Dept brings the table into Third Normal Form.
Figure 2-8: Correcting Third Normal Form violations by creating two
tables
![[IMAGE]](fig2_8.gif)
Once you have created your database in normalized form, you can perform benchmarks and decide to back away from normalization to improve performance for specific queries or applications.
The process of denormalizing:
Can be done with tables or columns
Assumes prior normalization
Requires a thorough knowledge of how the data is being used
Good reasons for denormalizing are:
All or nearly all of the most frequent queries require access to the full set of joined data
A majority of applications perform table scans when joining tables
Computational complexity of derived columns requires temporary tables or excessively complex queries
Denormalization should be based on thorough knowledge of the application, and it should be performed only if performance issues indicate that it is needed. For example, the ytd_sales column in the titles table of the pubs2 database is a denormalized column that is maintained by a trigger on the salesdetail table. The same values can be obtained using this query:
select title_id, sum(qty)
from salesdetail
group by title_id
To obtain the summary values and the document title requires a join with the titles table:
select title, sum(qty)
from titles t, salesdetail sd
where t.title_id = sd.title_id
group by title
It makes sense to denormalize this table if the query is run frequently. But there is a price to pay: you must create an insert/update/delete trigger on the salesdetail table to maintain the aggregate values in the titles table. Executing the trigger and performing the changes to titles adds processing cost to each data modification of the qty column value.
This situation is a good example of the tension between decision support applications, which frequently need summaries of large amounts of data, and transaction processing applications, which perform discrete data modifications. Denormalization usually favors one form of processing at a cost to others.
Figure 2-9: Balancing denormalization issues
![[IMAGE]](fig2_9.gif)
Whatever form of denormalization you choose, it has the potential for data integrity problems that must be carefully documented and addressed in application design.
Denormalization has these disadvantages:
It usually speeds retrieval but can slow data modification.
It is always application-specific and needs to be re-evaluated if the application changes.
It can increase the size of tables.
In some instances, it simplifies coding; in others, it makes coding more complex.
Denormalization can improve performance by:
Minimizing the need for joins
Reducing the number of foreign keys on tables
Reducing the number of indexes, saving storage space and reducing data modification time
Precomputing aggregate values, that is, computing them at data modification time rather than at select time
Reducing the number of tables (in some cases)
When deciding whether to denormalize, you need to analyze the data access requirements of the applications in your environment, and their actual performance characteristics. Often, good indexing and other solutions solve many performance problems.
Some of the issues to examine when considering denormalization include:
What are the critical transactions, and what is the expected response time?
How often are the transactions executed?
What tables or columns do the critical transactions use? How many rows do they access each time?
What is the mix of transaction types: select, insert, update, and delete?
What is the usual sort order?
What are the concurrency expectations?
How big are the most frequently accessed tables?
Do any processes compute summaries?
Where is the data physically located?
The most prevalent denormalization techniques are:
Adding redundant columns
Adding derived columns
Collapsing tables
In addition, you can duplicate or split tables to improve performance. While these are not denormalization techniques, they achieve the same purposes and require the same safeguards.
You can add redundant columns to eliminate frequent joins. For example, if frequent joins are performed on the titleauthor and authors tables in order to retrieve the author's last name, you can add the au_lname column to titleauthor.
Figure 2-10: Denormalizing by adding redundant columns
![[IMAGE]](fig2_10.gif)
Adding redundant columns eliminates joins for many queries. The problems with this solution are that it:
Requires maintenance of new column. All changes must be made to two tables, and possibly to many rows in one of the tables.
Requires more disk space, since au_lname is duplicated.
Adding derived columns can help eliminate joins and reduce the time needed to produce aggregate values. The total_sales column in the titles table of the pubs2 database provides one example of a derived column used to reduce aggregate value processing time.
The example in Figure 2-11 shows both benefits. Frequent joins are needed between the titleauthor and titles tables to provide the total advance for a particular book title.
Figure 2-11: Denormalizing by adding derived columns
![[IMAGE]](fig2_11.gif)
You can create and maintain a derived data column in the titles table, eliminating both the join and the aggregate at run time. This increases storage needs, and requires maintenance of the derived column whenever changes are made to the titles table.
If most users need to see the full set of joined data from two tables, collapsing the two tables into one can improve performance by eliminating the join.
For example, users frequently need to see the author name, author ID, and the blurbs copy data at the same time. The solution is to collapse the two tables into one. The data from the two tables must be in a one-to-one relationship to collapse tables.
Figure 2-12: Denormalizing by collapsing tables
![[IMAGE]](fig2_12.gif)
Collapsing the tables eliminates the join, but loses the conceptual separation of the data. If some users still need access to just the pairs of data from the two tables, this access can be restored by queries that select only the needed columns or by using views.
If a group of users regularly needs only a subset of data, you can duplicate the critical table subset for that group.
Figure 2-13: Denormalizing by duplicating tables
![[IMAGE]](fig2_13.gif)
The kind of split shown in Figure 2-13 minimizes contention, but requires that you manage redundancy and possible latency.
Sometimes, splitting normalized tables can improve performance. You can split tables in two ways:
Horizontally, by placing rows in two separate tables, depending on data values in one or more columns
Vertically, by placing the primary key and some columns in one table, and placing other columns and the primary key in another table.
Figure 2-14: Horizontal and vertical partitioning of tables
![[IMAGE]](fig2_14.gif)
Splitting tables--either horizontally or vertically--adds complexity to your applications. There usually needs to be a very good performance reason.
Use horizontal splitting in the following circumstances:
A table is large, and reducing its size reduces the number of index pages read in a query. B-tree indexes, however, are generally very flat, and you can add large numbers of rows to a table with small index keys before the B-tree requires more levels. An excessive number of index levels may be an issue with tables that have very large keys.
The table split corresponds to a natural separation of the rows, such as different geographical sites or historical vs. current data. You might choose horizontal splitting if you have a table that stores huge amounts of rarely used historical data, and your applications have high performance needs for current data in the same table.
Table splitting distributes data over the physical media (there are other ways to accomplish this goal, too).
Generally, horizontal splitting adds a high degree of complication to applications. It usually requires different table names in queries, depending on values in the tables. This complexity alone usually far outweighs the advantages of table splitting in most database applications. As long as the index keys are short, and the indexes are used for queries on the table (rather than table scans being used), doubling or tripling the number of rows in the table may increase the number of disk reads required for a query by only one index level.
Figure 2-15 shows how the authors table might be split to separate active and inactive authors:
Figure 2-15: Horizontal partitioning of active and inactive data
![[IMAGE]](fig2_15.gif)
Use vertical splitting in the following circumstances:
Some columns are accessed more frequently than other columns.
The table has wide rows, and splitting the table reduces the number of pages that need to be read.
Vertical table splitting makes even more sense when both of the above conditions are true. When a table contains very long columns that are not accessed frequently, placing them in a separate table can greatly speed the retrieval of the more frequently used columns. With shorter rows, more data rows fit on a data page, so fewer pages can be accessed for many queries.
Figure 2-16 shows how the authors table can be partitioned.
Figure 2-16: Vertically partitioning a table
![[IMAGE]](fig2_16.gif)
Whatever denormalization techniques you use, you need to develop management techniques to ensure data integrity. Choices include:
Triggers, which can update derived or duplicated data anytime the base data changes
Application logic, using transactions in each application that updates denormalized data to be sure that changes are atomic
Batch reconciliation, run at appropriate intervals to bring the denormalized data back into agreement
From an integrity point of view, triggers provide the best solution, although they can be costly in terms of performance.
In Figure 2-17, the sum_adv column in the titles table stores denormalized data. A trigger updates the sum_adv column whenever the advance column in titleauthor changes.
Figure 2-17: Using triggers to maintain normalized data
![[IMAGE]](fig2_17.gif)
If your application has to ensure data integrity, it will have to ensure that the inserts, deletes, or updates to both tables occur in a single transaction.
Figure 2-18: Maintaining denormalized data via application logic
![[IMAGE]](fig2_18.gif)
If you use application logic, be very sure that the data integrity requirements are well documented and well known to all application developers and to those who must maintain applications.
Note: Using application logic to manage denormalized data is risky. The same logic must be used and maintained in all applications that modify the data.
If 100-percent consistency is not required at all times, you can run a batch job or stored procedure during off hours to reconcile duplicate or derived data.
You can run short, frequent batches or longer, less frequent batches.
Figure 2-19: Using batch reconciliation to maintain data
![[IMAGE]](fig2_19.gif)