Had the interesting problem today of trying to detect duplicate XML values in a SQL Server table (stored in a column of field type
I’m not the first to cover this issue:
- Matt Woodward has covered the simplest case of duplicate checks
- There’s a StackOverflow question covering it
However, my “requirements” as they were differed a little from the existing solutions:
- Only compare a subset of the XML value (a timestamp header would still differ on XML values we considered duplicates)
- Only apply the duplicate checking to some of the rows
- Keep the first row of a set of duplicates
- Support large XML values
- Develop this solution ASAP (this was needed in a critical support situation)
A single application had populated the table, so fortunately I could assume all XML was equally processed.
Now in the heat of the moment I went with a hashed solution as StackOverflow proposed. Upon reflection however I don’t see a real reason (beyond perhaps performance) I couldn’t have followed Matt Woodward’s method and performed a simple string conversion and comparison (well a comparison of substrings given the requirement to ignore the header). Indeed it’s the improved solution I list below.
However, given the hashing direction the
HASHBYTES function seemed a good start.
Returns the MD2, MD4, MD5, SHA, SHA1, or SHA2 hash of its input.
Unfortunately, while this is a decent enough function for hashing, it led me on a bit of a wild goose chase as I tried to work out why I couldn’t seem to cast my
XML field to
varchar without getting an error. It turned out that this was because I was passing my cast value straight to
HASHBYTES which was in turn throwing an error.
At least this time the implications were glaringly obvious, and I had noone else to blame but myself. In the process at least I learned a bit more about
XML support in
SQL, and found some pretty cool resources including this series of SQL XML Workshops from Jacob Sebastian found while searching on StackOverflow.
It turned out that conversion to
XML was as easy as I’d first tried:
Once I got past this problem the query I ran with was pretty easy to put together.
First, I would only hash content from after the header, which I could identify by the character index of either the header closing tag or content start tag (found using
CHARINDEX). I could have also used SQL Server’s XML query functions, but I was short on time and this seemed suitable for a throwaway task.
CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn))
Then I could use a
SUBSTRING to grab the next 8000 characters (to support
HASHBYTES) from this point:
SUBSTRING( CONVERT(varchar(max), XmlColumn), --// Data CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Start Index 8000) --// Length
This ultimately gave me the following query, which suited our purposes:
--// WARNING: THIS CODE HAS A MAJOR ISSUE (SEE BELOW) UPDATE OurTable SET IsDuplicate = 1 WHERE OtherFilters = 'SomeValue' --// Other filters, so only checking certain rows AND Id NOT IN ( --// ID is either unique or matches the first 'duplicate' SELECT MIN(Id) FROM OurTable --// Pick the lowest ID for a given hash WHERE OtherFilters = 'SomeValue' --// Same as the OtherFilters above! GROUP BY --// Group by hash (i.e. duplicates grouped) HASHBYTES('md5', --// Hash Function SUBSTRING( --// Hash Data (the substring of 8000 chars after the header) CONVERT(varchar(max), XmlColumn), --// Substring Data CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Index 8000))) --// Length
There is a major issue with this query! Can you spot it?
The problem is that it assumes the first 8000 characters of the XML (well, after the trimmed header) are enough for unique identification. While this may be a valid assumption (as it was in my case), if it’s wrong you run the risk of falsely labeling some XML a duplicate, which could be disastrous.
At the time I considered resolving this by appending the hashes of the next 8000 characters, possibly repeating if necessary to increase the character coverage (one of my less bright ideas).
In retrospect simple string comparison was not only sufficient, but a better (if not as performant) solution (the goose chase of believing that conversion of large
varchar had issues put this idea out of my mind). It also has the added benefit of avoiding admittedly unlikely MD5 collisions.
However, if performance is a concern (or depending on the number of rows and size of XML values involved) hashing may still be a good fallback option (perhaps even with my terrible ‘concatenated hashes’ approach). Alternatively I would probably look to point LINQPad at the database and write some pretty basic code to perform duplicate detection (which would have additional advantages if the XML was processed in different ways with regard to whitespace and self-closing tags).
Assuming performance isn’t a serious bottleneck, something like the following works (and is what I’ve used subsequently):
UPDATE OurTable SET IsDuplicate = 1 WHERE OtherFilters = 'SomeValue' --// Other filters, so only checking certain rows AND Id NOT IN ( --// ID is either unique or matches the first 'duplicate' SELECT MIN(Id) FROM OurTable --// Pick the lowest ID for a given hash WHERE OtherFilters = 'SomeValue' --// Same as the OtherFilters above! GROUP BY --// Group by content (i.e. duplicates grouped) SUBSTRING( CONVERT(varchar(max), XmlColumn), --// Substring Data CHARINDEX('<Content>', CONVERT(varchar(max), XmlColumn)), --// Start Index LEN(CONVERT(varchar(max), XmlColumn)))) --// Length, excess is handled
Since originally publishing this I did a little more research on the issues with hashing, and should you go down this direction there’s a great article by Thomas Kejser on SQL hash functions and performance. For one, it makes sense to use something like
SHA1 instead of the
MD5 in my example as it reduces the likelihood of collision issues without a noticeable performance difference.
I have the feeling there’s much room for others optimisations and approaches here, so if anyone has run into this problem before or has some ideas on what to do next time I’m curious to hear more!