Skip to content

Method to remove duplicates in an SQL database

sql user duplicate delete - Method to remove duplicates in an SQL database

Duplicates in a database take up unnecessary space and can cause errors on a website or application. In a previous article, SQL queries to find duplicates were presented. This article presents a query to remove duplicates while keeping one row.

WARNING : before starting to execute the delete request, you must remember to make a backup (cf. a backup) of the table concerned. In the worst case, it will be possible to re-install it.

Remove duplicates from a table

Reminder: query to find duplicates

Before you want to remove any duplicates, you must first check if there are any. The following query is to be used by replacing fields 1 to 3 with the fields of your choice which must be used as a repository to find the duplicate rows.

SELECT   COUNT(*) AS nbr_doublon, champ1, champ2, champ3
FROM     table
GROUP BY champ1, champ2, champ3
HAVING   COUNT(*) > 1

Be careful, depending on the type of duplicated lines, you may not need to do a search using the fields that serve as timestamps (example: a DATETIME field).

Query to remove duplicates

If the previous query returned results considered to be duplicates, then it’s time to move on to the next step to prepare for the deletion. The query below is used to remove the extra rows that have the same data in the “field1” to “field3” columns (to be customized).

DELETE FROM table
LEFT OUTER JOIN (
        SELECT MIN(id) as id, champ1, champ2, champ3
        FROM table
        GROUP BY champ1, champ2, champ3
    ) as t1 
    ON table.id = t1.id
WHERE t1.id IS NULL

Warning, there are several things to know about this query:

  • The “id” field is to be personalized according to the name of the column of your own table.
  • The query will keep the row with the smallest id because the MIN () function is used. It is possible to keep the query with the largest id by using the MAX () function instead.
  • This query assumes that no field is NULL.

Solution n ° 2

DELETE t1 
FROM table AS t1, table AS t2
WHERE t1.id > t2.id
AND t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.column3 = t2.column3

Check

You have to think about checking that everything went well after deleting all these lines. It is possible to check that there are no more duplicates. If there are still any, now is the time to start over and correct the query. If, on the other hand, too many rows have been deleted, the backup must be re-installed so as not to lose rows unnecessarily.

To note : there are other ways to remove duplicates. Feel free to use another technique if you are not comfortable with it.

window.___gcfg = {lang: 'fr'}; (function() { var po = document.createElement('script'); po.type="text/javascript"; po.async = true; po.src="https://apis.google.com/js/plusone.js"; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-1599467-30']); _gaq.push(['_trackPageview']); _gaq.push(['_trackPageLoadTime']); (function() { var ga = document.createElement('script'); ga.type="text/javascript"; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })();