What is Deduplication, and Why is it Important for eDiscovery? Learn with GoldFynch!

23 April 2024 by Ross eDiscovery deduplication

Takeaway: In the sphere of eDiscovery, where vast quantities of data are analyzed for legal purposes, efficiency and accuracy are a top priority. Deduplication is a key tool through which legal professionals are able to work with such large volumes of information, making it more manageable, and in the process making their review more effective.

Let’s dig deeper into the uses of deduplication in the eDiscovery workflow, and get an overview of reviewing the results of a deduplication operation in GoldFynch.

For a functional guide to deduplication in GoldFynch, see: finding duplicate files in your GoldFynch case using deduplication.

What is deduplication?

Deduplication is the process of identifying and eliminating duplicate documents within a dataset. In eDiscovery, where data sets can be massive and include a lot of redundant information, deduplication plays a crucial role in streamlining the review process. By removing (or “culling”) duplicates, legal teams can focus on unique files and content, saving time and resources.

Benefits of Deduplication

Some of the ways Deduplication can help in the eDiscovery process include:

  1. Cost Reduction: Deduplication reduces the volume of data that needs to be processed and reviewed, leading to significant cost savings in storage and review expenses.
  2. Time Savings: Removing duplicates from your review accelerates the review process, allowing legal teams to focus on relevant information promptly.
  3. Increased Accuracy: By eliminating duplicate documents from your review, deduplication ensures that you can focus on unique content and avoid having multiple partially-reviewed versions of documents. This helps reduce the risk of inconsistencies, contradictions, and review metadata being lost.
  4. Enhanced Review Efficiency: With a unique dataset, reviewers can efficiently identify key documents and make informed decisions.

Deduplication in GoldFynch

Deduplication in GoldFynch uses either the MD5 hash value of your files or the Message-ID (for emails) to detect duplicates, and if the files are not exact matches based on the deduplication strategy selected (MD5 hash based, Message-ID based, etc), then they will not be detected as duplicates. MD5 file hashes serve as a digital “signature” for files, and even the slightest change to a file’s data (visible or binary) will change the file hash.

It’s also worth noting that deduplication is done on a root family level since it’s not typically desired/allowed to exclude or remove duplicate attachments that belong to non-duplicate parent files. So GoldFynch will not mark attachment files as duplicates (even if the file hashes are the same), unless the parent files are also duplicates.

To summarize: if you run a deduplication session using the hash-based strategy and there are no detected duplicates, then the duplicate-looking files are either attachments to non-duplicate parent files or the files aren’t, in fact, exact duplicates.

More information on why attachments are not detected in a deduplication session can be found ****here**.**

Reviewing GoldFynch’s Deduplication Results

Accessing deduplication reports is fast and simple with GoldFynch. Here’s how:

  1. Access the Deduplication Report: In GoldFynch, navigate to the “deduplication” screen to access the deduplication report for your case. More information on generating this report (either applied or waiting to apply) can be found here.
  2. Review Duplicate Report: The report displays duplicate items identified, each containing the exact identical documents. Review the report to gain an understanding of the duplicate documents.

Components of the GoldFynch Deduplication Results

The GoldFynch duplicate report contains the following information:

  • APP Link - This is a direct link to the document in your GoldFynch case (only accessible if you are logged into an account that has access to your case)
  • APP ID - GoldFynch’s internal ID which is used to track each individual file that is uploaded
  • APP Parent ID - This is the ID of the Parent document. If there is no parent then it is the same as the APP ID
  • Keep? - When the value is TRUE it indicates that the file is primary, and FALSE indicates that the file is a duplicate
  • File Name - File name of the document
  • Pathname - Path of the document in GoldFynch
  • Tags - All tags attached to the document will be listed

In case the files are emails, the following fields will be populated with the available metadata:

  • Subject
  • From
  • To
  • Cc
  • Bcc
  • Sent
  • Message ID

Note: If the source does not have metadata, these fields will be blank, even if they are emails.

What next?

Once the deduplication session is applied, the system will not automatically delete the duplicate items, but instead, mark them with a system tag = DUPE.

It is worth mentioning, as part of the typical workflow, once these files have been marked with the system tag, we suggest you create a review set of your case, which will automatically exclude any system-marked duplicates. Other benefits of conducting your review using review sets are detailed here.

Deduplication is a powerful tool in your eDiscovery arsenal, and it can help significantly in reducing costs, saving time, increasing accuracy, and enhancing review efficiency. Once you’re able to leverage it effectively, it’ll be a big step towards navigating the complexities of culling data and managing documents for your eDiscovery cases, letting you focus on pertinent information so you can build a stronger case.

If you want a software that makes deduplication a breeze, try GoldFynch for free! Its deduplication feature is designed to be thorough, accurate, and intuitive, so legal professionals can work with confidence on their investigations and litigation endeavors.

Want to learn more about GoldFynch?