Duplicates in a database take up unnecessary space and can cause errors on a website or application. In a previous article, SQL queries to find duplicates were presented. This article presents a query to remove duplicates while keeping one row.
WARNING : before starting to execute the delete request, you must remember to make a backup (cf. a backup) of the table concerned. In the worst case, it will be possible to re-install it.
Reminder: query to find duplicates
Before you want to remove any duplicates, you must first check if there are any. The following query is to be used by replacing fields 1 to 3 with the fields of your choice which must be used as a repository to find the duplicate rows.
SELECT COUNT(*) AS nbr_doublon, champ1, champ2, champ3 FROM table GROUP BY champ1, champ2, champ3 HAVING COUNT(*) > 1
Be careful, depending on the type of duplicated lines, you may not need to do a search using the fields that serve as timestamps (example: a DATETIME field).
Query to remove duplicates
If the previous query returned results considered to be duplicates, then it’s time to move on to the next step to prepare for the deletion. The query below is used to remove the extra rows that have the same data in the “field1” to “field3” columns (to be customized).
DELETE FROM table LEFT OUTER JOIN ( SELECT MIN(id) as id, champ1, champ2, champ3 FROM table GROUP BY champ1, champ2, champ3 ) as t1 ON table.id = t1.id WHERE t1.id IS NULL
Warning, there are several things to know about this query:
- The “id” field is to be personalized according to the name of the column of your own table.
- The query will keep the row with the smallest id because the MIN () function is used. It is possible to keep the query with the largest id by using the MAX () function instead.
- This query assumes that no field is NULL.
Solution n ° 2
DELETE t1 FROM table AS t1, table AS t2 WHERE t1.id > t2.id AND t1.column1 = t2.column1 AND t1.column2 = t2.column2 AND t1.column3 = t2.column3
You have to think about checking that everything went well after deleting all these lines. It is possible to check that there are no more duplicates. If there are still any, now is the time to start over and correct the query. If, on the other hand, too many rows have been deleted, the backup must be re-installed so as not to lose rows unnecessarily.
To note : there are other ways to remove duplicates. Feel free to use another technique if you are not comfortable with it.