Understanding and Managing Different Types of Duplicates in eDiscovery

21 February 2025 by ROSS eDiscovery deduplication

Takeaway: Deduplication is a game-changer in eDiscovery. It’s a huge time-saver, and removing duplicate files ensures you’re not repeatedly reviewing the same thing. But what is deduplication exactly? And what kinds of duplicates pop up during eDiscovery? Let’s break it down. We’ll walk through the types of duplicates you might encounter, how to spot them, what makes them different, and how to tackle deduplication so you can work smarter—not harder.

Before we look at the different kinds of duplicates, let’s understand what deduplication is.

Deduplication is the process of finding and removing duplicate files in your dataset. In eDiscovery, it helps reduce the number of documents you need to review, saving a ton of time and money. Plus, it keeps you from reviewing the same thing twice.

However, not all duplicates are the same. Understanding the different types of duplicates is key to choosing the right approach.

1. Exact duplicates (file-level duplicates)

These are the easiest to understand—they’re identical files: duplicate content, metadata, everything. You might end up with these when people copy, forward, or download the same document multiple times.

  • How to spot them: Exact duplicates have the same digital fingerprint, called a hash value (e.g., MD5, SHA-1). If two files have the same hash, they’re exact duplicates.

2. Near duplicates (document-level duplicates)

These are similar files but not the same. Someone may have made a small edit, added a comment, or saved the document in a different format.

  • How to spot them: Near duplicates are found using content comparison tools that give you a similarity score (e.g., 90% match).

3. Email duplicates

Email is notorious for creating duplicates. When you forward a message, reply to it, or CC a group, suddenly, a dozen versions of the same email are floating around.

  • How to spot them: Look at metadata like message IDs, subjects, and timestamps. Advanced tools also use email threading to group related emails.

4. Embedded Duplicates (Attachments and Inline Files)

Sometimes duplicates come from attachments or embedded files—like when the same document is attached to multiple emails or included in different reports.

  • How to spot them: These are identified by checking parent-child relationships in your eDiscovery platform.

Now that we know the different types of duplicates let’s see how we can deduplicate each of them.

1. Hash-based deduplication (exact duplicates)

  • Run a hash analysis (e.g., MD5 or SHA-1).
  • Choose global deduplication (across custodians) or custodian-level deduplication (within individuals).
  • Check that the hash matches before removing anything.

Most eDiscovery software platforms have built-in functionality to conduct hash-based deduplication. Learn how you can use GoldFynch to find exact duplicates

2. Near duplicate detection

  • Use content comparison tools.
  • Set a similarity threshold (e.g., 90%) to group similar docs.
  • Review the group together and focus on the most relevant version.

Note: Soon, you will be able to detect near-duplicates using GoldFynch.

3. Email threading and metadata deduplication

  • Use email threading tools to link related emails.
  • Look for similarities in message ID, email body hash, and metadata.
  • Suppress redundant versions, but keep unique attachments.

Click here to find out how you can use GoldFynch’s deduplication system to find email duplicates

4. Family and Attachment Deduplication

  • Keep parent-child relationships intact.
  • Deduplicate identical attachments, even if they show up in different emails.
  • Use your eDiscovery platform’s visual cues to track duplicates.

Based on our understanding, let’s summarize the key differences between each duplicate type.

| Duplicate type | Common Cause | Identification process | Deduplication approach | | ——————- | —————————– | ————————– | ———————————— | | Exact Duplicates | File copying or downloading | Hash value | Hash-based deduplication | | Near Duplicates | Minor edits or format changes | Similarity score | Content comparison, group review | | Email Duplicates | Forwarding, replies, or CCs | Metadata or Message-ID | Email threading or metadata analysis | | Embedded Duplicates | Attachments, embedded files | Parent-child link | Family deduplication |

Now that we know about the different kinds of duplicates and how to handle them, let’s examine some of the best practices for successful deduplication.

  1. Start Early: Plan your initial deduplication approach to avoid getting buried in data later.
  2. Keep a safety net: Always save at least one copy of each document to maintain a defensible process.
  3. Choose the proper scope: Decide if you need global deduplication across all custodians or just within each custodian’s data.
  4. Use the right tools: Pick an eDiscovery platform with strong deduplication, email threading, and near duplicate detection.
  5. Train your team: Make sure your team understands the different types of duplicates and how to handle them.

Deduplication is one of those behind-the-scenes tasks that can make a huge difference in your eDiscovery project.

Knowing what deduplication is, understanding the different types of duplicates in eDiscovery, and knowing how to handle them can save your legal team time, effort, and money. By implementing effective deduplication strategies tailored to exact duplicates, near duplicates, emails, and embedded files, you can streamline your review process while ensuring the integrity and completeness of your data. When done right, deduplication turns data chaos into organized clarity, empowering your team to focus on what truly matters—building a strong legal case.

Looking for eDiscovery software that makes deduplication a breeze? Try GoldFynch

GoldFynch’s meticulously crafted to be comprehensive, precise, and user-friendly, empowering legal professionals to conduct their investigations with confidence. It’s an easy-to-use eDiscovery service that’s perfect for small- and midsize law firms and companies. You can sign up for a free trial without a credit card in seconds.

  • It costs just $27 a month for a 3 GB case: That is significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing is free). So, choose from a range of plans (3 GB to 150+ GB) and know upfront how much you’ll be paying. It takes just a few clicks to move from one plan to another, and billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch and you’re good to go. Plus, it’s designed, developed, and run by the same team. So you get prompt and reliable tech support.
  • It keeps you flexible. To build a defensible case, you need to be able to add and delete files freely. Many applications charge to process each file you upload, so you’ll be reluctant to let your case organically shrink and grow. And this stifles you. With GoldFynch, you get unlimited processing for free. So, on a 3 GB plan, you could add and delete 5 GB of data at no extra cost – as long as there’s only 3GB in your case at any point. And if you do cross 3GB, your plan upgrades automatically and you’ll be charged for only the time spent on each plan. That’s the beauty of prorated pricing.
  • Access it from anywhere. And 24/7. All your files are backed up and secure in the Cloud.

Want to learn more about GoldFynch?