What is Tokenization? And How Does It Power eDiscovery Searches?

18 February 2022 by UV eDiscovery searching tokenization

Takeaway: Tokenization is the process eDiscovery applications use to break up free-flowing sentences and paragraphs into discrete units (or ‘tokens’) that slot into a database. And it’s one of many processes that make eDiscovery searches possible.

eDiscovery searches pull data from behind-the-scenes databases. But how do you fit whole sentences and paragraphs into a database?

When you run an eDiscovery search, your software isn’t actually ‘reading’ each paragraph of thousands of case files. Rather, it’s matching your search terms (keywords, keyword phrases, etc.) against a behind-the-scenes database. (These databases are sort of like massive spreadsheets having rows and columns of cells, with each cell holding a keyword or important bit of file information.) Breaking up entire paragraphs into concise database units is time-consuming up front, but time-saving later on. However, it brings up a tricky question: How do you split up free-flowing paragraphs and sentences into database units?

Here’s where tokenization comes in. It’s a way of teaching a computer where and how to break up sentences.

We humans build language around words and sentences, using grammar as a guide. Instead, computers manipulate binary strings (i.e., long sequences of zeros and ones) using algorithms as a guide. So, to help eDiscovery applications split up grammatically-complex sentences into chunks for a database, we give them algorithmic rules. The resulting chunks are called ‘tokens,’ and the rules are called ‘tokenization’ rules. So, think of tokenization as a bridge between organic, human thinking and mathematical computer-speak. It tells your software how to spot and deal with things like letters, numbers, words, punctuation, capitalization, diacritical marks (e.g., the accent on the last ‘e’ of café), and more.

Tokenization gets pretty complicated, though, because all languages regularly break their own rules.

Splitting a block of text into words should be pretty easy, right? Just look for the blank spaces between letters! Or check for basic punctuation marks? Well, it’s not that simple. First, looking for blank spaces will work only with European languages like English – since Chinese, Japanese, and Thai scripts use blank spaces differently. But even with English, things get messy quickly. For example, if a full stop always signals the end of a sentence, then what should your software do about the full stop in ‘Mr.’ or ‘Ms.’? Similarly, do the hyphens in ‘state-of-the-art’ signify the same thing as the hyphen separating two parts of a sentence? (E.g., “He ran as fast as he could – but that’s not saying much”.)

To solve these challenges, programmers come up with creative workarounds.

There’s a creative fix for our hyphen example above. Programmers let search engines choose from one of these options: (1) Treat hyphens as searchable text, (2) Treat them as spaces, (3) Ignore them, or (4) Do all three. Each option has its advantage depending on the situation. For example, treating hyphens as spaces will catch all three of these instances: ‘big-data-based,’ ‘big data-based,’ and ‘big-data based’. But the option to ignore hyphens would help you catch the typo ‘bigdata.

As a part of these workarounds, eDiscovery applications have an ‘exclusions’ list.

Search engines use exclusions to avoid unnecessary complications. For example, they’ll exclude characters like ‘?@%’ from searches and treat them as spaces instead. Similarly, they’ll ignore a bunch of common *stop words like ‘are,’ ‘for,’ ‘in,’ ‘just,’ ‘but,’ and ‘also,’ that we use for grammar, but which don’t add any meaning to sentences. Most search engines also ignore acronyms, initialization, and two-letter words. And all search engines ignore single letters.

These technicalities might seem trivial, but they do make a difference.

eDiscovery applications are pretty low-maintenance these days, but you’ll still want to know how they work. For example, stop-word lists vary between applications, so what if one of your search keywords happens to be on your software’s stop-word list? The search will get no hits, and you’ll wrongly assume it’s because your keyword isn’t anywhere in your data. (E.g., What happens if you run a search for ‘Yahoo! Mail’ when ‘!’ is on the exclusion list?)

Similarly, there are a bunch of other eDiscovery software processes you’ll want to know about, too.

When you upload files into your eDiscovery software, it has to process your data via a series of steps. And tokenization is just one of these steps. For reference, here are some of the others.

  1. Finding your data. Your case-related files will usually be mixed up with unrelated data on a hard drive. So, your software first has to find it using an API (application programming interface). This is like a messenger between your eDiscovery software and the platform your data sits on (e.g., Dropbox). APIs let these separate pieces of software communicate while staying independent. Learn more about data extraction.
  2. Figuring out file types. Once it’s found your data, your software needs to separate it into component files – documents, emails, images, etc. And to open them, it needs to figure out their file types (e.g., ‘DOCX’ for Microsoft Word files, ‘PDF’ for Adobe Reader files, etc.) Learn more about identifying file types.
  3. Decoding file data. After opening a file, your software needs to take apart its layers of text, images, metadata, formatting instructions, and more. It then needs to explore each bit of data and figure out what it is (e.g., is it a date/time value, an embedded database, an image, etc.?). And then, it needs to decode these data elements into a format it can work with.

Thankfully, many newer eDiscovery applications have streamlined processes like tokenization.

At GoldFynch, we designed our eDiscovery software to offer fast and reliable data extraction. So, you don’t have to worry about the errors we’ve discussed in this post. But GoldFynch also has an essential eDiscovery toolkit and other features that might interest you:

  • It costs just $27 a month for a 3 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing files is free). So, choose from a range of plans (3 GB to 150+ GB) and know up-front how much you’ll be paying. You can upload and cull as much data as you want, as long as you stay below your storage limit. And even if you do cross the limit, you can upgrade your plan with just a few clicks. Also, billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and a processing cap of 1 GB), without adding a credit card.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch, and you’re good to go. Plus, you get prompt and reliable tech support.
  • Access it from anywhere, and 24/7. All your files are backed up and secure in the Cloud.

Want to find out more about GoldFynch?