Can Optical Character Recognition (OCR) Affect eDiscovery?

20 September 2022 by UV eDiscovery OCR

Takeaway: OCR can speed up eDiscovery by rapidly converting scanned documents, image-only PDFs, etc., into machine-readable text. But there’s a catch: Your OCR software needs to be accurate. So, ensure it’s a well-tested tool, and ideally, built into your eDiscovery software.

Optical character recognition (OCR) is a way of converting paper documents into machine-readable text.

OCR (optical character recognition – also known as text recognition) is a way of extracting text from paper documents, scanned pages, handwritten notes, photographs/images, image-only PDFs, and more. We say ‘extract’ because, to a computer, a scanned page is more like a photograph than a document. So, OCR software recognizes and isolates letter shapes on a page, ‘translating’ them into machine-readable text. Most often, OCR consists of just software, but sometimes it includes hardware like optical scanners or specialized circuit boards. And the software can range from simple algorithms to advanced intelligent character recognition that uses artificial intelligence (AI).

Obviously, this technology comes with a bunch of useful benefits.

OCR revolutionizes document reviews by making scanned text ‘searchable.’ So, you’ll be able to use a search engine to sift through gigabytes of data that would have taken days to review manually. Equally important, it makes all your paper documents accessible online to anyone with a link. So you can keep old, archived contracts relevant – updating/revising them, backing them up in the Cloud, and even adding a digital signature to authenticate and protect them. Plus, a roomful of paper files can now fit on a tiny USB stick, so you’ll be saving space. These advantages make OCR ideal for tasks like data entry (digitizing checks, passports, receipts, etc.), data extraction (isolating names and dates from insurance documents, for example), book scanning, automatic license plate recognition, handwriting conversion in real-time (i.e., ‘pen computing’), and more.

It’s this wide range of benefits that popularized OCR in the 70s.

Before OCR became mainstream, the only way to digitize documents was to retype them manually. But in 1974, Ray Kurzweil (founder of Kurzweil Computer Products, Inc.) created an omni-font OCR product that could recognize a range of fonts while making minimal errors. Initially, he used this technology in text-to-speech software for people with visual impairments. But OCR is so versatile that by the 90s, it was being used to digitize vintage newspapers. And today, Google uses Cloud Vision OCR to power smartphone document scanners.

So, how does OCR work? Well, it’s a seamless three-stage process.

Scanned documents are essentially a collection of dots (i.e., pixels) arranged in specific patterns. So, OCR applications have to decode the dot patterns – creating digital versions of the original letters, words, and sentences. Here’s how it works.

Step 1. Pre-processing: Boosting the image quality.

First, the software converts the document into black and white, making the text stand out as easily-identifiable dark spaces on a white background. Next, the software enhances the image using techniques like de-skewing (rotating the image), normalizing (adjusting its aspect ratio and scale), despeckling (removing spots), and zoning (identifying paragraphs, columns, etc.).

Step 2. Character recognition: Spotting characters, letters, and numbers.

The software analyzes the dark portions of the image, trying to recognize characters, letters, and numbers. It does this in one of two ways. The first option is to match the dark-pixel patterns with a pre-existing library of characters in its archive. And the second option is to analyze the shapes of all the lines on a page – figuring out the characters this way. (Here, an algorithm assesses the angles, curves, and intersections of each line/stroke, using this information to decode the characters.)

Step 3. Post-processing: Tidying up and correcting errors.

In this final step, the software reviews its work, looking for errors. For instance, it might compare the extracted words with a pre-installed word library – pulling out (possibly incorrect) words that don’t match any library entries. It might also catch errors by comparing its guesses with a library of commonly co-occurring words. (For instance, it’ll catch the mistake in the phrase ‘return in investment’ when it sees that those words co-occur more commonly as ‘return on investment.’) Finally, the software’s algorithm runs the extraction through pre-installed grammar guides, rewording its faulty extractions in its final round of error correction.

Even with all this, some OCR applications still make errors. And this is a problem for processes like eDiscovery.

Modern eDiscovery applications come loaded with valuable tools like a powerful search engine to find niche files, a tagging tool to add context to those files, and a production wizard to help redact sensitive information and prepare your case for external review. But all these features can’t help if there are errors in a document’s OCRed text. For instance, say you’re running a search for the keyword phrase ‘Anderson merger.’ Even if a handwritten memo you scanned has this phrase, your search engine might overlook it if there was an OCR error (e.g., mistranslating it as ‘Andersun merger’). So, reliable OCR software is critical for eDiscovery.

Here’s where Cloud eDiscovery services can help.

Cloud eDiscovery services like GoldFynch can automatically convert scanned documents, PDFs, etc., into machine-readable text. And they do this using a reliable, inbuilt OCR tool designed for high-value documents. GoldFynch offers this tool free with all its paid plans, and even with your free 512 MB starter case. But GoldFynch does more than offer dependable OCR. It’s a complete eDiscovery suite with the following features:

  • It costs just $25 a month for a 3 GB case: That’s significantly less than most comparable software. With GoldFynch, you know exactly what you’re paying for: its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing files is free). So, choose from a range of plans (3 GB to 150+ GB) and know up-front how much you’ll be paying. You can upload and cull as much data as you want, as long as you stay below your storage limit. And even if you do cross the limit, you can upgrade your plan with just a few clicks. Also, billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and a processing cap of 1 GB) without adding a credit card.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch, and you’re good to go. Plus, you get prompt and reliable tech support.
  • Access it from anywhere, and 24/7. All your files are backed up and secure in the Cloud.

Want to find out more about GoldFynch?