Detecting Duplicate XML data in SQL Server

Had the interesting problem today of trying to detect duplicate XML values in a SQL Server table (stored in a column of field type XML).

I’m not the first to cover this issue:

However, my “requirements” as they were differed a little from the existing solutions:

  1. Only compare a subset of the XML value (a timestamp header would still differ on XML values we considered duplicates)
  2. Only apply the duplicate checking to some of the rows
  3. Keep the first row of a set of duplicates
  4. Support large XML values
  5. Develop this solution ASAP (this was needed in a critical support situation)

A single application had populated the table, so fortunately I could assume all XML was equally processed.

My Solution

Now in the heat of the moment I went with a hashed solution as StackOverflow proposed. Upon reflection however I don’t see a real reason (beyond perhaps performance) I couldn’t have followed Matt Woodward’s method and performed a simple string conversion and comparison (well a comparison of substrings given the requirement to ignore the header). Indeed it’s the improved solution I list below.

However, given the hashing direction the HASHBYTES function seemed a good start.

HASHBYTES (Transact-SQL)

Returns the MD2, MD4, MD5, SHA, SHA1, or SHA2 hash of its input.

Unfortunately, while this is a decent enough function for hashing, it led me on a bit of a wild goose chase as I tried to work out why I couldn’t seem to cast my XML field to varchar without getting an error. It turned out that this was because I was passing my cast value straight to HASHBYTES which was in turn throwing an error.

See much like a past post I had missed something pretty crucial on the MSDN pageHASHBYTES has an 8000 character limit!

HASHBYTES Remarks

At least this time the implications were glaringly obvious, and I had noone else to blame but myself. In the process at least I learned a bit more about XML support in SQL, and found some pretty cool resources including this series of SQL XML Workshops from Jacob Sebastian found while searching on StackOverflow.

It turned out that conversion to XML was as easy as I’d first tried:

CONVERT(varchar(max), XmlColumn)

Once I got past this problem the query I ran with was pretty easy to put together.

First, I would only hash content from after the header, which I could identify by the character index of either the header closing tag or content start tag (found using CHARINDEX). I could have also used SQL Server’s XML query functions, but I was short on time and this seemed suitable for a throwaway task.

CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn))

Then I could use a SUBSTRING to grab the next 8000 characters (to support HASHBYTES) from this point:

SUBSTRING(
 CONVERT(varchar(max), XmlColumn), --// Data
 CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Start Index
 8000) --// Length

This ultimately gave me the following query, which suited our purposes:

--// WARNING: THIS CODE HAS A MAJOR ISSUE (SEE BELOW)
UPDATE OurTable 
SET IsDuplicate = 1
WHERE 
  OtherFilters = 'SomeValue' --// Other filters, so only checking certain rows
  AND Id NOT IN ( --// ID is either unique or matches the first 'duplicate'
    SELECT MIN(Id) FROM OurTable --// Pick the lowest ID for a given hash
    WHERE OtherFilters = 'SomeValue' --// Same as the OtherFilters above!
    GROUP BY --// Group by hash (i.e. duplicates grouped)
      HASHBYTES('md5', --// Hash Function 
       SUBSTRING( --// Hash Data (the substring of 8000 chars after the header)
         CONVERT(varchar(max), XmlColumn), --// Substring Data
         CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Index
         8000))) --// Length

Review

There is a major issue with this query! Can you spot it?

The problem is that it assumes the first 8000 characters of the XML (well, after the trimmed header) are enough for unique identification. While this may be a valid assumption (as it was in my case), if it’s wrong you run the risk of falsely labeling some XML a duplicate, which could be disastrous.

At the time I considered resolving this by appending the hashes of the next 8000 characters, possibly repeating if necessary to increase the character coverage (one of my less bright ideas).

In retrospect simple string comparison was not only sufficient, but a better (if not as performant) solution (the goose chase of believing that conversion of large XML to varchar had issues put this idea out of my mind). It also has the added benefit of avoiding admittedly unlikely MD5 collisions.

However, if performance is a concern (or depending on the number of rows and size of XML values involved) hashing may still be a good fallback option (perhaps even with my terrible ‘concatenated hashes’ approach). Alternatively I would probably look to point LINQPad at the database and write some pretty basic code to perform duplicate detection (which would have additional advantages if the XML was processed in different ways with regard to whitespace and self-closing tags).

Improved Solution

Assuming performance isn’t a serious bottleneck, something like the following works (and is what I’ve used subsequently):

UPDATE OurTable 
SET IsDuplicate = 1 
WHERE 
  OtherFilters = 'SomeValue' --// Other filters, so only checking certain rows
  AND Id NOT IN ( --// ID is either unique or matches the first 'duplicate' 
    SELECT MIN(Id) FROM OurTable --// Pick the lowest ID for a given hash 
    WHERE OtherFilters = 'SomeValue' --// Same as the OtherFilters above!
    GROUP BY --// Group by content (i.e. duplicates grouped) 
      SUBSTRING( 
        CONVERT(varchar(max), XmlColumn), --// Substring Data 
        CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Start Index 
        LEN(CONVERT(varchar(max), XmlColumn)))) --// Length, excess is handled

Since originally publishing this I did a little more research on the issues with hashing, and should you go down this direction there’s a great article by Thomas Kejser on SQL hash functions and performance. For one, it makes sense to use something like SHA1 instead of the MD5 in my example as it reduces the likelihood of collision issues without a noticeable performance difference.

I have the feeling there’s much room for others optimisations and approaches here, so if anyone has run into this problem before or has some ideas on what to do next time I’m curious to hear more!