When revising one of my popular sessions on scenarios in which indexes no longer help, one of the topics covered was foreign key constraints for data integrity and JOIN eliminiation. During my research, I came across various arguments from developers who reject foreign key constraints for data integrity in databases. This led me to take a closer look at these statements and the disadvantages listed. This is the first article on foreign keys in a series of articles I intend to write to test, confirm, or refute the claims.
Inhaltsverzeichnis
Edgar F. Codd – Rule #10
Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
It has been almost 40 years since Edgar F. Codd postulated his 12 rules describing guidelines for a relational database system that still exist today. Why do database developers resist standards and implement integrity rules in applications; it should be the job of RDBMS systems!
Integrity rules (data integrity) are enforced with constraints. Other constraints are e.g.
- NOT NULL
- UNIQUE
- PRIMARY KEY
Arguments against Foreign Key constraints for Data Integrity
I have researched a few opinions against the use of foreign keys on the Internet. I will leave it up to you to evaluate the reasons, but I will make my opinion known.
NOTE: All statements are from people who – in general – are not against foreign key relationships but give reasons why they should not be used!
These arguments seem far-fetched to me personally. Yes, the checks take time because the check accesses the master table. How big is the time difference if the check is carried out in the database and if the application checks the existence beforehand?
Classic arguments against foreign key constraints, aren’t they. I find the reasoning „strange“ because the authors says that they can delete „problematic“ rows or can do anything else with the data without concerns. The authors didn’t clearly fully grasped the concept of foreign key constraints. The fact is that in the relational model developed according to the rules of Edgar F. Codd, data integrity is important and there can be NO problematic rows!
- there can be no orders that have been assigned to an unknown customer.
- Customers cannot be deleted from a table if orders, invoices, or other dependend rows on other relations exist…
A foreign key relationship ensures that only customers that exist in the [customers] table can be selected for an order! So how can „problematic“ data rows arise when the database system is tasked with preventing them!
Debunking Myth about slow DML operations with Foreign Key constraints
With a few examples from my demo database ERP_DEMO, which I use in my workshops, I would like to examine the concerns about foreign keys and prove the opposite.
Time – an important criterion
It is stated that the time component prevents the implementation of foreign key constraints. When data is inserted into the detail table, data is changed or data is deleted, the RDBMS System checks the foreign keys to ensure data integrity. With three tests (each test ran 5 times!) I checked the time difference with and without foreign keys. The pictures showes the test results with the avg run time!
In the scenario, 10,000 rows are inserted simultaneously by 20 threads. Both the average time and the total time, separated by runtime and CPU time, are logged. The Stored Procedure is straight forward and pushes row by row into the table. The tests have been made with SQLQueryStress.
CREATE OR ALTER PROCEDURE dbo.InsertNewOrder
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
INSERT INTO dbo.demo_orders
SELECT TOP (1) * FROM dbo.orders;
END
GO
No Foreign Key Constraints – no check
The total runtime for entering 200,000 records (20 threads with 10,000 rows) is 15.5 seconds (median). Since no further checks had to be performed, the records could be entered directly. |
No Foreign Key constraints but check by application
If the data integrity is not to be ensured by the RDBMS system, the application must take on this task. That’s how fair it should be.
The stored procedure is extended to check whether the customer to be entered exists.
CREATE OR ALTER PROCEDURE dbo.InsertNewOrder
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
INSERT INTO dbo.demo_orders
SELECT TOP (1) o.*
FROM dbo.orders AS o INNER JOIN dbo.customers AS c
ON (o.o_custkey = c.c_custkey);
END
GO
The integrity test is kept primitive because I am only concerned with the temporal impact. I enter a data record and cover its validity for the customer with a JOIN. This means that the customer data record must exist.
The total runtime for entering 200,000 records (20 threads with 10,000 rows) is 18 seconds (median). The query uses an INNER JOIN to make sure that only a record is used where a valid customer exists!. |
Foreign Key Constraints for Data Integrity – BuiltIn
ALTER TABLE dbo.demo_orders
ADD CONSTRAINT fk_demo_orders_customers
FOREIGN KEY (o_custkey) REFERENCES dbo.customers (c_custkey);
GO
The total runtime for entering 200,000 records (20 threads with 10,000 rows) is 12 seconds (median). The check for the existing customer is done by a LEFT SEMI JOIN (e.g. EXISTS ()). |
evaluation of the results
The result surprised me. When I started this blog post, I was sure that inserting data without any further checks would win by a long way.
I was all the more surprised that, despite checking the value to be entered, the implemented foreign key clearly won the race.
test description | total runtime | avg runtime/iteration | avg CPU/iteration |
Native inserts without checks | ~15 sec. | 0,0010 sec. | 0,0010 sec. |
Native inserts with checks by app | ~18 sec. | 0,0014 sec. | 0,0007 sec. |
Inserts with Foreign Key | ~11 sec. | 0,0009 sec. | 0,0008 sec. |
Summary
Foreign keys are designed to comply with the rules for relational databases postulated by Edgar F. Codd. Their purpose is not to annoy developers or make their lives difficult.
One argument against using foreign keys is the time delay when entering/updating/deleting data. A foreign key has hardly any measurable influence on the entry of data. In my example, they were faster than other techniques.
Thank you for reading!