One of my favorite interview questions is to hand the candidate a marker and ask them to write out how to find all duplicates in a table. This should be straightforward and weeds out anyone who struggles with SQL; and even if they don’t struggle with SQL, this will be a good way to gauge where they’re at. But, it doesn’t stop there!
I’ve since evolved a little since I posted this and would like to hone my focus of this post to a clearer target: Look For Ridiculous instead.
After they successfully give an answer, typically one involving grouping by the business key having a count greater than one, which I’ll show in the first example below, that is a go-to correct response to this question. But, I throw them a curve ball, I’ll say “Great! Show me another way.” Then, I ask for another, and another, and another… with great power comes great responsibility being the interviewer is nefariously fun!

So, let’s take a look at some ways on how to find all duplicates in a table by exploring all the ways that I’ve come up with. Here’s our test data we’ll be working with, which has 2 sets of duplicates:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
CREATE TABLE FACT_ORDERS ( CUSTOMER_KEY INTEGER, ORDER_DATE_KEY INTEGER, ORDER_AMOUNT DECIMAL(15,2), ETL_LOAD_DATE DATE ); INSERT INTO FACT_ORDERS SELECT 1, 20160901, 812.19, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 4, 20160901, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 3, 20160901, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 4, 20160902, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 3, 20160901, 123.46, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 2, 20160901, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 3, 20160901, 123.46, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 1, 20160902, 812.19, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 1, 20160901, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 2, 20160901, 5169.23, CAST('2016-09-17' AS DATE); INSERT INTO FACT_ORDERS SELECT 2, 20160901, 5169.23, CAST('2016-09-17' AS DATE); |
The goal is to learn how to find all the duplicates in a table and return only 2 rows of data, ie: that have more than one identical row, even though there are 5 rows of duplicates (three dupes of one row and two of another).
All of these methods should be ANSI standard and compile on SQL Server, Teradata, and Oracle, Postgres… There are a few exceptions, like the Hashbyte example, where you may have to modify them a bit to work with your RDBMS.
Example 1 – Finding duplicates using GROUP BY HAVING
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT * FROM FACT_ORDERS GROUP BY 1,2,3,4 HAVING COUNT(1) > 1; -- Query Result Set -- /************************************************************\ |CUSTOMER_KEY | ORDER_DATE_KEY | ORDER_AMOUNT | ETL_LOAD_DATE| | 2 | 20160901 | 5169.23 | 2016-09-17 | | 3 | 20160901 | 123.46 | 2016-09-17 | \************************************************************/ |
This is the correct answer and is the method that you should be using when answering the question: how to find all duplicates in a table. But, we’re interested in judging the candidate’s skill and advanced knowledge of SQL, so we’ll compound upon this answer in the following examples. Just understand that when querying data that this is the 100% correct, performant way to find all duplicates in a table – whether or not you’re curious about every column or just a specific business key.
Example 2 – Finding duplicates using ROW_NUMBER window function
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT CUSTOMER_KEY ,ORDER_DATE_KEY ,ORDER_AMOUNT ,ETL_LOAD_DATE FROM( SELECT FOO.* ,ROW_NUMBER() OVER(PARTITION BY CUSTOMER_KEY,ORDER_DATE_KEY,ORDER_AMOUNT,ETL_LOAD_DATE ORDER BY CUSTOMER_KEY) AS DUPE FROM FACT_ORDERS )FOO WHERE DUPE = 2; -- Query Result Set -- /************************************************************\ |CUSTOMER_KEY | ORDER_DATE_KEY | ORDER_AMOUNT | ETL_LOAD_DATE| | 2 | 20160901 | 5169.23 | 2016-09-17 | | 3 | 20160901 | 123.46 | 2016-09-17 | \************************************************************/ |
This method uses the ROW_NUMBER window function to find all duplicates in a table. This is a great way to see how well they manage PARTITION BY within a window function. This isn’t an ideal way of actually finding duplicates as it requires several nested sub-queries and makes the code unnecessarily messy. However, it is a great tool during an interview!
Example 3 – Finding duplicates using SUM window function
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
SELECT CUSTOMER_KEY ,ORDER_DATE_KEY ,ORDER_AMOUNT ,ETL_LOAD_DATE FROM( SELECT * ,SUM(1) OVER(PARTITION BY CUSTOMER_KEY,ORDER_DATE_KEY,ORDER_AMOUNT,ETL_LOAD_DATE ORDER BY DUPE ROWS UNBOUNDED PRECEDING) AS REAL_DUPE FROM (SELECT * ,SUM(1) OVER(PARTITION BY CUSTOMER_KEY,ORDER_DATE_KEY,ORDER_AMOUNT,ETL_LOAD_DATE ORDER BY CUSTOMER_KEY ROWS UNBOUNDED PRECEDING) AS DUPE FROM FACT_ORDERS ) FOO WHERE DUPE > 1 )BAR WHERE REAL_DUPE = 1; -- Query Result Set -- /************************************************************\ |CUSTOMER_KEY | ORDER_DATE_KEY | ORDER_AMOUNT | ETL_LOAD_DATE| | 2 | 20160901 | 5169.23 | 2016-09-17 | | 3 | 20160901 | 123.46 | 2016-09-17 | \************************************************************/ |
This method is almost identical to the row_number method above, but using the SUM window function instead. The key takeaway here is the use of ROWS UNBOUNDED PRECEDING to properly re-create a row_number like summing. Again, this is not a preferred method to find duplicates, but a way to validate a candidates skill of different window functions. Often times I’ll even say “Show me how to find all duplicates in a table by using the SUM window function.” This is a tough one.
Example 4 – Finding duplicates using CTE and HASHBYTES
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
WITH CTE AS( SELECT *, HASHBYTES('SHA1', CAST(CUSTOMER_KEY AS VARCHAR(20))+CAST(ORDER_DATE_KEY AS VARCHAR(20))+CAST(ORDER_AMOUNT AS VARCHAR(20))+CAST(ETL_LOAD_DATE AS VARCHAR(20))) AS HASHBYTE FROM FACT_ORDERS ) SELECT DISTINCT CUSTOMER_KEY,ORDER_DATE_KEY,ORDER_AMOUNT,ETL_LOAD_DATE FROM (SELECT HASHBYTE FROM CTE GROUP BY HASHBYTE HAVING COUNT(HASHBYTE) > 1) AS FOO INNER JOIN CTE ON FOO.HASHBYTE = CTE.HASHBYTE; -- Query Result Set -- /************************************************************\ |CUSTOMER_KEY | ORDER_DATE_KEY | ORDER_AMOUNT | ETL_LOAD_DATE| | 2 | 20160901 | 5169.23 | 2016-09-17 | | 3 | 20160901 | 123.46 | 2016-09-17 | \************************************************************/ |
Now we’re getting ridiculous! The HASHBYTE is overkill, I know, but it does two useful things – it cleans up the code and on a large table, and it provides for a fast join. This method requires the knowledge of a CTE, otherwise, you’d have to nest the same logic twice which nobody wants to see. The use of the DISTINCT is really what makes this query ridiculous, but when it comes to extrapolating a person’s SQL skills, it’s a great way to do so.
Do you have another way?
Can you come up with some additional ways to find all the dupes in a table? I’d like to hear more examples that explore some greater variation with SQL.
Leave a comment below!