What Is eDiscovery Data Compression? And How Does It Work?

13 July 2021 by UV eDiscovery compression

Takeaway: Compressing data will save you space and speed up file transfers. And there are two types of compression: Lossy (where ignorable data is deleted) and lossless (where an entire file is restructured). But with eDiscovery, the trick is to find software that decompresses your data without damaging it.

Most eDiscovery cases will have files where data has been ‘compressed.’

Compressing data involves modifying or restructuring it so that it takes up less space. And this streamlines eDiscovery by saving you money (you pay for less Cloud storage) and time (compressed files upload/download much faster). Of course, it also means that your eDiscovery software has to be able to decompress (or ‘unpack’) these files. But we’ll get to that in a bit.

To compress a file, you’ll first create an ‘archive.’ But these archives often do more than just compress data.

Archives are ‘container’ files into which you can put your data so that it’s easy to store and share. There are more than 250 different archive types, but the most common are ZIPs and RARs. These sorts of archives are handy when you have folders within folders because they keep this folder hierarchy intact. But they do other things, too.

1. Archives embed additional information to add context to your data

The following information needs to be embedded into each archive so that your eDiscovery software can make sense of it.

  • File system data. Raw data is a big chunk of information with no clues about where one file ends and another begins. A ‘file system’ helps with this. It’s a set of logic rules your compression software uses to processes raw data and decide how to store and retrieve it. An archive will have this file system data embedded in it so that your eDiscovery software will know how to unpack your files properly.
  • Metadata. Think of metadata as a digital footprint that tracks the history of a document. It’s made up of all the data ‘stamped’ onto a file by the software that created it. For example, when you create a Word file, Microsoft Word records a bunch of information about it – like who created it, when they created it, when it was last opened, etc. There’s tons of this sort of metadata attached to each file in your case, but you won’t be able to see it unless you know where to look. Learn more about metadata.

2. Archives give your data an ID to make sure it stays intact.

To make sure their files don’t get altered (mistakenly or on purpose), archives use something called a ‘checksum.’ This is a string of letters and numbers that identifies a file, much like a digital fingerprint. And your software will crosscheck this checksum, later on, to confirm that the archive’s files haven’t been altered. A checksum is very specific (its value changes if you alter even a single character in a document), and is created using special algorithms like MD5, SHA-1, SHA-256, etc. (Here’s an example of a checksum generated using the MD5 algorithm: bc527343c7ffc103111f3a694b004e2.)

3. Archives often encrypt data to keep it secure.

Some archives can encrypt their files, too. With ‘asymmetric’ encryption, an archive will use a pair of algorithms – one to encrypt its data, and one to decrypt it. This second algorithm is also called the ‘key.’ Often, the encryption algorithm is publicly available, while the key is kept private. This way, anyone can encrypt the archive using the public algorithm, but only someone trusted with the private key can decrypt it. When security isn’t that important, archives might use ‘symmetric’ encryption, where there’s just one algorithm used to both encrypt and decrypt an archive.

So, how does compression work? Well, it depends on the type of compression we’re talking about.

Electronic data is fundamentally a string of 0s and 1s. And the longer the string, the more space it takes up. Compression is a way of shortening this string, but there are two ways of doing this.

1. ‘Lossy’ compression nips away at useful but ignorable data.

Sometimes, we don’t need every single bit of data in a file. For example, you can remove the subsonic sounds in an MP3 without being able to hear a difference. And removing those sounds will shrink the MP3 significantly, without changing how we experience it. This is the principle behind ‘lossy’ compression, where an algorithm deletes data that isn’t strictly necessary. And lossy compression is powerful – reducing a file to up a third of its size. But you have to be okay losing a bit of quality. For example, JPEG images use lossy compression to save space. But you’ll notice them becoming less sharp, losing color depth, and developing jagged margins. Most of the time, though, this tradeoff is worth it.

2. ‘Lossless’ compression restructures a file rather than deleting data.

Lossy compression might work with media files. But you’d ruin something like a spreadsheet or an EXE file if you tried to delete data – however insignificant it might seem. That’s where ‘lossless’ compression comes in. Lossless formats like PNG (for images), FLAC (for audio), and ZIP use more innovative ways to shrink file sizes. For example, imagine rewriting the string of letters ‘WWWBBWW’ as ‘3W2B2W.’ Instead of writing out each letter, you’re noting down the number of times it appears. Now imagine a black-and-white image on your screen, where each pixel is either white (W) or black (B). (So you could transcribe a photograph into a grid of just W’s and B’s) Here, you could use the same ‘rewriting’ principle to shrink this long list of W’s and B’s to less than half its original size! Lossy compression techniques like ‘run-length encoding’ (one of many techniques) are built on these sorts of strategies.

And this is why your eDiscovery software matters. Because it needs to be able to decompress archive files without damaging them.

Your software needs to recognize the format of a compressed file and use the appropriate algorithm to ‘unpack’ it. And if it doesn’t do this right, it can destroy both the file and its metadata.

GoldFynch is eDiscovery software that knows how to decompress files properly. But it does more than just that.

At GoldFynch, we’ve tailored our eDiscovery service for small and midsize law firms like yours. And this means more than just having the right technology. It means designing software that’s simple, reliable, and affordable. Here’s what’s unusual about GoldFynch:

  • It costs just $10 a month for a 1 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing is free). So, choose from a range of plans (1 GB to 150+ GB) and know up-front how much you’ll be paying. You can upload and cull as much data as you want, as long as you stay below your storage limit. And even if you do cross the limit, you can upgrade your plan with just a few clicks. Also, billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and a processing cap of 1 GB), without adding a credit card.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch, and you’re good to go. Plus, you get prompt and reliable tech support.
  • Access it from anywhere, and 24/7. All your files are backed up and secure in the Cloud.

Want to find out more about GoldFynch?