Big Data

01 March 2016 by Anith Mathai general

You’ve heard the term over and over, but until now you’ve been crossing your fingers that it doesn’t apply to you, right? Well, we hate to be the bearers of bad news, but big data relates to us all. If you’ve ever made a purchase with a credit card, signed up for newsletter, voted, answered a survey, or received an email you can assume that somewhere out there you are but a data point in an endless ocean of information.

But how does Big Data relate to the legal industry? How does one manage big data in the context of a legal practice? And why would anyone want to?

To answer these questions, we need to understand a little more about data in general.

Data is just another word for information. So, it would stand to reason that Big Data is just a lot of information.

This is a pretty simple concept, but something you may not be aware of is that data can be categorized in several different ways, and those categories are what help us understand how best to work with it. The first classification we’ll go briefly summarize relates to the very broad topic of Quantitative vs. Qualitative Data:

  • Qualitative Data: There are a lot of emails. This email is a follow up to a meeting. The follow up message is overall positive feedback.

  • Notice there are no quantities used to describe the emails. All the information we have about the emails is based on their qualities. This information is subjective because “a lot” of emails could mean 20 emails or it could mean 20,000. Overall positive feedback could be read as neutral feedback or negative feedback by someone else.

  • Quantitative Data: There are 14 emails. 1 email arrived at 3:25 PM CST. This email rates the meeting a 3 out of 5.

  • Notice quantities or numbers are used to describe the emails. All the information we have about the emails is based on values that have been assigned to them, either to represent quantity, time, or a rating scale. None of this information is subjective. 3:25 PM CST is a universally recognized time that does not mean 3:26 PM CST or 3:24 PM CST.

  • Quantitative data is easier to work with because it’s black and white, right or wrong, 14 emails or 15 emails – there is no in between.

Next, there’s the question of whether what you’re dealing with is Structured or Unstructured.

Structured Data is probably what we see when we imagine big data: rows and rows of neat little columns filled with values or codes in spreadsheet form.

If it can be split up into neat, clean categories, you’re probably dealing with structured data. Using the email example, those categories would be the “metadata”: Sender, Recipient, Time Sent, Time Received, Number of characters, etc. The fields that make up these categories may be quantitative or qualitative. The imaginary spreadsheet representing these emails would have narrow columns with 1 entry in each cell. This imaginary spreadsheet would be very easy to manage, search and organize.

Unstructured Data relates more to the content of the aforementioned email example: Subject Lines, Topic, etc.. When an online poll has questions that ask you to rate something on a scale of 1-5, that rating is structured data. When that same online poll requires you to provide a paragraph explaining why you gave a certain rating, that’s unstructured. A more thorough definition of Unstructured Data is outside the scope of this post. Suffice it to say, Unstructured Data is more difficult to manage because it’s generally for the benefit of humans, not computers. It’s also the majority of the data used in the legal industry.

Emails, phone call transcripts, social media posts, voicemail messages, texts, contracts and diagrams all fall into the category of “Unstructured”. This imaginary “spreadsheet” used to collect Unstructured Data would be very difficult to manage, search and organize. The columns would be wide, with more than one value in each cell and some cells may have whole paragraphs of text.

So how does one go about managing unstructured data?

There are 2 options. The first and most common solution is to not manage it at all, and instead to manually read through all the unstructured data each time you need to access it for information. This is a common approach because it saves a lot in up-front costs. The downside is that it’s not as cost-effective in a scenario where a lot of unstructured data is required to carry out your work. Most of the costs associated with not using a management system relates to the time required to efficiently search through this kind of material, to find what you’re looking for on short notice, and to produce documentation to the opposing counsel.

The second option is to make an up-front investment in a tool designed to assist in managing your Unstructured Data.

This tool could be an employee devoted to reading through, summarizing and annotating your data. It could be a spreadsheet used to capture and search data, or it could be software made specifically for managing unstructured data. These software products are typically marketed as data management software, document review software, or e-discovery software.

So this leaves us with the final question: Why would someone bother to manage their data?

The driving force behind the adoption of data management systems is the fact that our lives are taking place, more often than not, in the digital world, where everything we say and do is recorded. This means that even relatively simple litigation requires more discovery or document review than ever before. To keep up with all that information, it only makes sense to turn to technology for help. Without it, the simplest tasks can take exponentially more time than previously required, before the rise of social media, email, texting, et. al.