Here's How eDiscovery Software Identifies File Types

29 July 2021 by Ross eDiscovery

Takeaway: Your eDiscovery software has to identify the various ‘file types’ in your case before it can process and prepare them for review. And it does this by analyzing a file’s metadata, file extension, and/or its MIME type. But there’s a lot that can go wrong during all this, and so it’s worth finding eDiscovery software that knows how to process your files properly.

The best eDiscovery applications give you a range of valuable tools to use on your case data.

Electronic discovery (eDiscovery) is so much more efficient than older ‘paper’ discovery because you now have the power of a computer to help you review files. For example, your eDiscovery software’s search engine can search your files for keywords and keyword combinations. So you can give it a complex command like, “Find the emails Dennis Nedry sent Sally Grant that mention the Pfizer meeting.” And once you’ve found these emails, you’ll be able to ‘tag’ them (tags are like virtual sticky notes) and pull them up later with a single click.

But to use these tools, your software first has to identify the various ‘file types’ in your case.

Each of your files is structured and coded differently, and so they get processed differently, too. For example, all files have ‘metadata’ (i.e., bonus information that adds context to a file) outlining when the file was created, who created it, who last opened it, etc. But where and how the metadata is stored will vary depending on whether you’re looking at a Word file, an Excel spreadsheet, an email, etc. And there’s more to deal with than metadata. For example, archives like RARs and ZIPs have to be decompressed, email messages have to be isolated from their message threads, their attachments identified and processed, and more. Getting a file type wrong will mean your software will process it incorrectly, and likely ruin the data.

So how does your eDiscovery software figure out file types? Well, with a Windows operating system, it’ll look at file extensions and metadata.

Each file in your case is, in its most basic form, a sequence of bits and bytes. I.e., it’s binary code. And all binary files have blocks of metadata called ‘headers.’ Embedded in each header is a ‘signature’ or ‘magic number’ (e.g., 50 4B 03 04 for Word files, 25 50 44 46 for Adobe PDFs, etc.) which identifies what type of file this binary code has been built into. In addition to these magic numbers, each of your files will also have specific file extensions – DOCX for Word documents, XLSX for Excel sheets, PST for Outlook inboxes, and so on. Your eDiscovery software figures out file types on a Windows system by looking at both a file’s magic number and its file extension.

If you’re not using Windows, then your software will likely use MIME (Multipurpose Internet Mail Extensions) to identify file types.

Only Windows uses file extensions. So, if you’re using some other operating system like Linux or macOS, then your software will use something called MIME type detection. MIME (Multipurpose Internet Mail Extensions) was initially created to help format and send email messages. But over time it started being used for text, audio, video, images, and application files, too. So, your software will use MIME (instead of file extensions) to identify file types on a non-Windows operating system. And it’ll decipher MIME using the Internet Assigned Numbers Authority (IANA) as a guide. (Note: IANA tracks internet standards like IP addresses, domain names, and other internet-related protocols.)

So, your software has to navigate a complex world of binary headers, file extensions, and MIME. And this journey takes it through a tree-like file hierarchy.

Identifying file types will help your software process files, but it helps in classifying them, too. First, your software will figure out the basic file type – document, image, video, etc. Then, it’ll find out the subtype – for example, a document could be a Word file, a PowerPoint file, a Mac Keynote file, and more. And sometimes, it’ll divide these subtypes into even more specific subtypes. Think of it as a tree with a broad trunk that then leads off into increasingly smaller branches. And this tree-like classification becomes very useful later on when you’re tagging, searching, and culling your eDiscovery data.

The problem is that not all eDiscovery applications use the right algorithms, and this can ruin your case data.

You’ll ideally want your software to flag a file it can’t identify and then move on to the next one. But some eDiscovery applications might try and extract whatever text and metadata they can salvage. And thinking that they’ve done their job, they might not inform you of the initial problem. Obviously, this will have serious consequences downstream when you realize (too late) that you’ve missed a bunch of poorly processed files.

Here’s where finding trustworthy technology can help. The right software can streamline eDiscovery as a whole – beyond just the file identification challenge.

At GoldFynch, we’ve tailored our eDiscovery service for small and midsize law firms. So, we’ve prioritized essential tasks like identifying file types, processing them correctly, and keeping you updated. But there’s more about GoldFynch that might interest you.

  • It costs just $10 a month for a 1 GB case: That’s significantly less than most comparable software. With GoldFynch, you know what you’re paying for exactly – its pricing is simple and readily available on the website.
  • It’s easy to budget for. GoldFynch charges only for storage (processing is free). So, choose from a range of plans (1 GB to 150+ GB) and know up-front how much you’ll be paying. You can upload and cull as much data as you want, as long as you stay below your storage limit. And even if you do cross the limit, you can upgrade your plan with just a few clicks. Also, billing is prorated – so you’ll pay only for the time you spend on any given plan. With legacy software, pricing is much less predictable.
  • It takes just minutes to get going. GoldFynch runs in the Cloud, so you use it through your web browser (Google Chrome recommended). No installation. No sales calls or emails. Plus, you get a free trial case (0.5 GB of data and a processing cap of 1 GB), without adding a credit card.
  • It’s simple to use. Many eDiscovery applications take hours to master. GoldFynch takes minutes. It handles a lot of complex processing in the background, but what you see is minimal and intuitive. Just drag-and-drop your files into GoldFynch, and you’re good to go. Plus, you get prompt and reliable tech support.
  • Access it from anywhere, and 24/7. All your files are backed up and secure in the Cloud.

Want to find out more about GoldFynch?