How Do eDiscovery Search Engines 'Read' Files?

09 February 2022 by Ross eDiscovery searching

Takeaway: eDiscovery search engines read your files by ‘extracting’ data. And this extraction involves splitting up a document’s contents and slotting them into a behind-the-scenes database. A database your software can then skim to find the keywords, names, dates, etc., that you’re looking for. Data extraction sets the stage for your entire eDiscovery review, so you’ll want to make sure your software does it quickly and smoothly.

eDiscovery deals with electronic data – not paper documents. And this changes your entire workflow.

Reviewing paper documents meant leafing through each page in turn, underlining important sentences, and sticking post-its on key pages. Now, we could technically do the same thing with electronically stored information (ESI), but there’s usually too much of it to cover. Digital data accumulates so easily that we’re often dealing with hundreds of gigabytes of emails, PDFs, spreadsheets, and more. And it’s impractical to try and manually go through all that information – especially when search engines can do it faster. Also, we’ll overlook valuable metadata if we’re only reading a document’s body content.

That’s why ‘data extraction’ is so essential. It’s a way of processing digital data so that your eDiscovery software can help you review it.

In earlier posts, we looked at how eDiscovery applications process the data you upload, and how data gets normalized. But ‘data extraction’ lies at the heart of all this. It’s the process eDiscovery applications use to decode your data and plug it into a database. And it’s this database that helps search engines find the keywords, names, dates, etc., that you’re looking for.

But your software has to find the data first, before extracting it. And that’s tough, considering data is essentially a string of continually-changing zeroes and ones.

Electronic files are layered with text, images, metadata, formatting/configuration instructions, and more. But computers see all of this in terms of binary strings, so your eDiscovery software has to find these binary strings first. And remember, these strings continually grow and shrink as you edit documents, and they change their location on your hard drive, too!

So, your software needs help finding your data. And it does this using either APIs or document filters.

An API (application programming interface) is like a messenger between two pieces of software. It helps them stay separate and independent while still letting them share information. So, if your eDiscovery data is in, for example, Dropbox, then Dropbox’s API will guide your software. However, if there aren’t any APIs around to help, your software will need to use its second option: Document filters. These are data extraction templates that do an API’s job – except that they’re stored in your eDiscovery software. Document filters are useful but not ideal because your software has to create, run, and update them, whereas APIs handle these things themselves. In fact, creating document filters is so tiresome that software developers often save time by borrowing existing commercial or open-source filters.

After finding the data, your software needs to decode it using a process called ‘recursion.’

Remember how we said that electronic files are layered with text, images, metadata, formatting/configuration instructions, and more? Well, after finding your data, your software has to start taking apart all these layers. It first figures out the data’s file type (ASCII text? JPEG? PDF? etc.). It then explores the data, identifying each sub-element (e.g. Is this chunk of data a date/time value? Or an embedded database? Or an image? etc.) And finally, your software decodes these elements so that they’re understandable. ‘Recursion’ describes the process of repeating this 3-step loop in one cycle after another, until there’s nothing left to explore in the file.

After decoding your data, your software plugs it into a database.

Think of databases as spreadsheets with cells arranged in rows and columns. By plugging the pieces of decoded data into the different cells, your software can retrieve them in seconds. That’s why databases are the key to quick searches – they prevent your software from having to ‘re-read’ your documents each time you give it a search command.

Some of the split-up data may have elements that need to stay connected. And your software uses unitization to protect these connections. For example, archive files (RAR, ZIP, etc.) need to stay connected to their contents (e.g., a compressed group of PowerPoint presentations). Similarly, file elements need to stay connected to their metadata. And documents need to stay connected to their source-custodian.

Finally, your software must track any errors that surface during this multi-stage process.

Data extraction gets quite tricky, so your software has to track and report the inevitable errors that pop up. For example, perhaps some files are in rare formats that can’t be opened? Perhaps others are encrypted or corrupted and can’t be read? And some might need optical character recognition (OCR) before their contents can be extracted. Error tracking is so important because you don’t want to randomly stumble upon (and have to fix) errors right before a deadline. Or, worse still, you don’t want your software to try a sub-standard ‘quick fix’ that changes or leaves out data.

Data extraction might be complicated, but you still want it to be fast and smooth. And that’s something newer eDiscovery applications offer.

At GoldFynch, we designed our eDiscovery software to offer fast and reliable data extraction. But it also has the essential eDiscovery tools you’ll need, at an affordable price. Here’s more about GoldFynch that might interest you:

  • It costs just $10 a month for a 1 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing files is free). So, choose from a range of plans (1 GB to 150+ GB) and know up-front how much you’ll be paying. You can upload and cull as much data as you want, as long as you stay below your storage limit. And even if you do cross the limit, you can upgrade your plan with just a few clicks. Also, billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and a processing cap of 1 GB), without adding a credit card.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch, and you’re good to go. Plus, you get prompt and reliable tech support.
  • Access it from anywhere, and 24/7. All your files are backed up and secure in the Cloud.

Want to find out more about GoldFynch?