What Is 'Hashing' in eDiscovery? And How Can It Help Cut Costs?

20 December 2020 by Ross ediscovery hashing

Takeaway: Hashing is the process of giving a file a unique identification number. This ‘hash value’ is a digital fingerprint your eDiscovery software uses to compare files and spot duplicates. And since every case has a large number of duplicates, hashing can help you cull files and save money on storage space.

eDiscovery storage is surprisingly affordable, but you’ll still want to cull data at some point.

The best eDiscovery applications don’t store data on your computer. Instead, they upload them into the Cloud – which is a global network of computer servers owned by software giants like Amazon and Google. Cloud computing is everywhere, and if you use Dropbox, Google Drive, or any Apple product, then you’re already in the Cloud. And the more people there are in the Cloud, the less the costs for each. So, you can now get storage for as little as $1 per GB. But too many unnecessary files will clutter your workflow, so at some point, you’ll need to start culling them.

Duplicate files are the low-hanging fruit you’ll start with when culling. They waste time, complicate your review, and increase costs.

Say your client’s head of HR – Donna – emails a policy-change document to everyone in the organization. You’ll find the same document on his computer, the computers of other employees, and attached to emails sitting in inboxes, servers and backup drives. This is a time-waster (you could end up reviewing 20 copies of the same document), but it also complicates your review. For example, you might mark one copy as ‘privileged’ but miss another copy. Or you might mistakenly give them different tags – which becomes a problem when you ‘produce’ these files. By some estimates, up to 40% of your case files will be duplicates. So it’s worth getting rid of them at some point. And this is where hashing comes in.

Hashing is a smart technique your eDiscovery software uses to spot duplicate files. It’s the process of giving a file a unique identifying number – a ‘digital fingerprint’ of sorts.

All your eDiscovery data is digital. At the most basic level, they’re numbers – a series of zeroes and ones. And numbers can be compared. Hashing is a technique where your software takes in this digital data and uses an algorithm to assign it a number called a ‘hash value’. It happens lightning fast and this number is so specific to its data that it can be considered the file’s fingerprint. You can give a hash value to anything. For example, a phrase can have a hash value (‘Mary had a little lamb’ gets the MD5 hash value e946adb45d4299def2071880d30136d4). But your software can also create hash values for a whole file, a group of files or even an entire hard drive.

Once a file has a hash value, you can compare it to the hash values of other files. And if these two ‘fingerprints’ match, you know you’ve got the same file.

What is an MD5 hash value though? And how is MD5 hashing different from SHA-1? Well, there are different types of hash algorithms (MD5, SHA-1, SHA-256, etc.), but they all do the same thing. They give you a file fingerprint you can use to compare to other files. An MD5 hash, for example, is a 32-character number, but can represent trillions of possible values! So, hash values obviously get very specific. If you delete a single comma in a document, its hash value changes. And not by just a bit – it can end up looking completely different. An important point about hashes, though, is that the conversion is one-way. So, you can give a file a hash value but you can’t convert that hash value back into a file. You can only compare it to other hash values.

Hashing is great for ‘deduplication’ – i.e., the eDiscovery process of spotting duplicate files in your case. But make sure your software offers custom deduplication options.

Here are the two most useful options:

1. Choosing what to deduplicate

The best eDiscovery software will offer

Whole-Case deduplication: All duplicate files in the case are found
Whole-Case vs. Folder deduplication: Compares a single folder against the entire case (i.e. “do any of the files in this folder exist in the case”)
Folder A vs. Folder B deduplication: Compares one folder another (i.e. “Are there any duplicates in Folder A for each item in Folder B”)

2. Choosing how to deduplicate

For most files, it’s best to compare hash values to spot duplicates. But for emails, its often useful to be able to compare other things like message IDs, subject lines, or timings. So, make sure your software offers you these options, too.

Note: Hashing also helps verify that case data hasn’t been tampered with.

Hashing is useful for more than just deduplication. It helps with file security, too. Since it’s a digital fingerprint, the hash value of a file stays the same as long as the file isn’t tampered with. So, if a digital forensics expert doublechecks a file’s hash value and notices it has changed, you know that something happened when it was collected, processed, and/or reviewed.

Want eDiscovery software that reliably catches duplicate files? Try GoldFynch.

It’s an easy-to-use eDiscovery service that’s perfect for small- and midsize law firms and companies.

It costs just $27 a month for a 3 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
It’s easy to budget for. GoldFynch charges only for storage (processing is free). So, choose from a range of plans (3 GB to 150+ GB) and know up front how much you’ll be paying. It takes just a few clicks to move from one plan to another, and billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and processing cap of 1 GB), without adding a credit card.
It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch and you’re good to go. Plus, it’s designed, developed, and run by the same team. So you get prompt and reliable tech support.
It keeps you flexible. To build a defensible case, you need to be able to add and delete files freely. Many applications charge to process each file you upload, so you’ll be reluctant to let your case organically shrink and grow. And this stifles you. With GoldFynch, you get unlimited processing for free. So, on a 3 GB plan, you could add and delete 5 GB of data at no extra cost – as long as there’s only 3 GB in your case at any point. And if you do cross 3 GB, your plan upgrades automatically and you’ll be charged for only the time spent on each plan. That’s the beauty of prorated pricing.
Access it from anywhere. And 24/7. All your files are backed up and secure in the Cloud.

Schedule a Demo