Takeaway: Your eDiscovery software has to identify the various ‘file types’ in your case before it can process and prepare them for review. And it does this by analyzing a file’s metadata, file extension, and/or its MIME type. But there’s a lot that can go wrong during all this, and so it’s worth finding eDiscovery software that knows how to process your files properly.

The best eDiscovery applications give you a range of valuable tools to use on your case data.

Electronic discovery (eDiscovery) is so much more efficient than older ‘paper’ discovery because you now have the power of a computer to help you review files. For example, your eDiscovery software’s search engine can search your files for keywords and keyword combinations. So you can give it a complex command like, “Find the emails Dennis Nedry sent Sally Grant that mention the Pfizer meeting.” And once you’ve found these emails, you’ll be able to ‘tag’ them (tags are like virtual sticky notes) and pull them up later with a single click.

But to use these tools, your software first has to identify the various ‘file types’ in your case.

Each of your files is structured and coded differently, and so they get processed differently, too. For example, all files have ‘metadata’ (i.e., bonus information that adds context to a file) outlining when the file was created, who created it, who last opened it, etc. But where and how the metadata is stored will vary depending on whether you’re looking at a Word file, an Excel spreadsheet, an email, etc. And there’s more to deal with than metadata. For example, archives like RARs and ZIPs have to be decompressed, email messages have to be isolated from their message threads, their attachments identified and processed, and more. Getting a file type wrong will mean your software will process it incorrectly, and likely ruin the data.

So how does your eDiscovery software figure out file types? Well, with a Windows operating system, it’ll look at file extensions and metadata.

Each file in your case is, in its most basic form, a sequence of bits and bytes. I.e., it’s binary code. And all binary files have blocks of metadata called ‘headers.’ Embedded in each header is a ‘signature’ or ‘magic number’ (e.g., 50 4B 03 04 for Word files, 25 50 44 46 for Adobe PDFs, etc.) which identifies what type of file this binary code has been built into. In addition to these magic numbers, each of your files will also have specific file extensions – DOCX for Word documents, XLSX for Excel sheets, PST for Outlook inboxes, and so on. Your eDiscovery software figures out file types on a Windows system by looking at both a file’s magic number and its file extension.

If you’re not using Windows, then your software will likely use MIME (Multipurpose Internet Mail Extensions) to identify file types.

Only Windows uses file extensions. So, if you’re using some other operating system like Linux or macOS, then your software will use something called MIME type detection. MIME (Multipurpose Internet Mail Extensions) was initially created to help format and send email messages. But over time it started being used for text, audio, video, images, and application files, too. So, your software will use MIME (instead of file extensions) to identify file types on a non-Windows operating system. And it’ll decipher MIME using the Internet Assigned Numbers Authority (IANA) as a guide. (Note: IANA tracks internet standards like IP addresses, domain names, and other internet-related protocols.)

So, your software has to navigate a complex world of binary headers, file extensions, and MIME. And this journey takes it through a tree-like file hierarchy.

Identifying file types will help your software process files, but it helps in classifying them, too. First, your software will figure out the basic file type – document, image, video, etc. Then, it’ll find out the subtype – for example, a document could be a Word file, a PowerPoint file, a Mac Keynote file, and more. And sometimes, it’ll divide these subtypes into even more specific subtypes. Think of it as a tree with a broad trunk that then leads off into increasingly smaller branches. And this tree-like classification becomes very useful later on when you’re tagging, searching, and culling your eDiscovery data.

The problem is that not all eDiscovery applications use the right algorithms, and this can ruin your case data.

You’ll ideally want your software to flag a file it can’t identify and then move on to the next one. But some eDiscovery applications might try and extract whatever text and metadata they can salvage. And thinking that they’ve done their job, they might not inform you of the initial problem. Obviously, this will have serious consequences downstream when you realize (too late) that you’ve missed a bunch of poorly processed files.

Here’s where finding trustworthy technology can help. The right software can streamline eDiscovery as a whole – beyond just the file identification challenge.

