Can Your eDiscovery Software Spot Duplicate Emails?

23 February 2021 by Ross ediscovery deduplication email

Takeaway: eDiscovery applications compare ‘hash’ values to spot duplicate documents, but ‘metadata matching’ is much better for catching duplicate emails. The trick is to find eDiscovery software that understands the difference.

Spotting and removing duplicate files is an essential part of eDiscovery.

Say your client’s head of HR – Jennifer – emails a policy-change document to everyone in the organization. You’ll find the same document on her computer, the computers of 50 other employees, and attached to emails sitting in inboxes, servers and backup drives. So, when you collect the data for eDiscovery, you’ll have 50+ copies of that same email to review. Clearly a waste of time. But it gets worse. For example, if you mark one copy of the email as ‘privileged’ but miss some of the other copies, you’ll risk leaking sensitive information in the final production. And it’s not just wasted time. There’s wasted money, too. If that email has a 1 MB attachment, its 50+ copies will take up 50+ MBs of unnecessary storage space that you’ll be paying for. And that’s just for one instance of a duplicate file. This is where ‘duplicate detection’ comes in. eDiscovery applications with this feature will pull up all copies of that email for you to delete, move, or label. And this makes the process so much better.

For most files, eDiscovery applications use ‘hashing’ to spot and flag duplicates. Hashing is the process of giving a file a unique identifying number – a digital ‘fingerprint’ of sorts.

All your eDiscovery files are digital. At the most basic level, they’re numbers – a series of zeroes and ones. And we can compare numbers. Hashing is a technique where your software takes in this digital data and uses an algorithm to assign it a number called a ‘hash value’. It happens lightning fast and this number is so specific to its data that it can be considered the file’s fingerprint. You can give a hash value to anything. For example, a phrase can have a hash value (‘Mary had a little lamb’ gets the hash value e946adb45d4299def2071880d30136d4). But your software can also create hash values for a whole file, a group of files or even an entire hard drive. Once a file has a hash value, you can compare it to the hashes of other files, and if the ‘fingerprints’ match, you know you’ve got a duplicate. (Learn more about hashing.)

Unfortunately, hashing isn’t ideal for emails because even the smallest change will alter hash values. And emails change a lot as they move back-and-forth between users.

Once you’ve created the final version of a document (a PDF, perhaps?), it rarely gets altered. But if you email it to someone, the email that carries it does get altered. And all the alterations – however tiny – change the email’s hash value. For example:

  • Email ‘timestamps’ change depending on when they were sent and received. So, the email in your ‘sent’ folder and the email in the recipient’s inbox might both have the same content, but their timestamps differ by milliseconds.
  • Email headers change depending on which server they go through.
  • Email formatting changes depending on the application (official email applications vs. private ones like Gmail, Yahoo!, AOL, etc.) and device (PC, tablet, or cellphone).
  • Email attachments can change hash values too, since they are part of the hashing algorithm. So, if a recipient removes the attachment from your email (when they’re archiving it, for example), it gets a different hash value. And often, attachments are removed by default. For example, cellphones sometimes extract images and icons as attachments, whereas an enterprise email system won’t.

So, instead of trying to match hash values, matching ‘metadata’ is much more useful to detect duplicate emails.

When you create a document on your computer, the application you’re using (e.g., Microsoft Word) records a bunch of information about it. Things like who created it, when they created it, when it was last opened, etc. This ‘data about data’ (i.e., metadata) is a digital footprint (as opposed to a hash fingerprint) that tracks the document’s history. All files have metadata embedded in them, but you won’t see it unless you know where to look. For example, email metadata includes things like who created the email and when they sent it, whom it was sent to, when they received it, and whether they read it. Conveniently, you can see this metadata even if you weren’t the one sending or receiving the email. So, rather than matching digital fingerprints that change easily (i.e., hash values), we’d rather match digital footprints that don’t change (i.e., some metadata fields).

To be fair, hashing does take metadata into account. But not the right kind of metadata.

Here again, we see the difference between hashing for documents and hashing for emails. With documents, hashing leaves out important contextual information like a document’s name, where it’s stored, and the dates attached to it. In contrast, hashing includes a lot of the contextual information about emails, and this information is likely to change – even if the email’s content doesn’t.

So, what metadata is ideal to detect duplicate emails? The best eDiscovery applications use the ‘message ID’ and ‘subject’ metadata fields.

These are the fields that stay the same regardless of the inevitable tiny, irrelevant email changes we looked at earlier.

  1. Message ID: This is an identifier that your email application generates for each email you send. And this ID stays the same regardless of to whom and when you send the email. Only when the email is modified or revised does it get a new message ID.
  2. Conversation subject: Email applications thread conversations together using the ‘subject’ line from the original email message. This ‘subject’ metadata field accurately predicts whether an email is a copy of another conversation or part of a whole new one.

Reliable eDiscovery software will have to understand these intricacies of detecting duplicate emails. And that’s why we developed GoldFynch.

##

It’s an easy-to-use eDiscovery service that’s perfect for small- and midsize law firms and companies.

  • It costs just $27 a month for a 3 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing is free). So, choose from a range of plans (3 GB to 150+ GB) and know up front how much you’ll be paying. It takes just a few clicks to move from one plan to another, and billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch and you’re good to go. Plus, it’s designed, developed, and run by the same team. So you get prompt and reliable tech support.
  • It keeps you flexible. To build a defensible case, you need to be able to add and delete files freely. Many applications charge to process each file you upload, so you’ll be reluctant to let your case organically shrink and grow. And this stifles you. With GoldFynch, you get unlimited processing for free. So, on a 3 GB plan, you could add and delete 5 GB of data at no extra cost – as long as there’s only 3 GB in your case at any point. And if you do cross 3 GB, your plan upgrades automatically and you’ll be charged for only the time spent on each plan. That’s the beauty of prorated pricing.
  • Access it from anywhere. And 24/7. All your files are backed up and secure in the Cloud.

Want to learn more about GoldFynch?