Dark Data – Big Data’s evil twin?

As we existed 2013 and entered 2014, the usual high volume of new year IT predictions were made. Twitter and the blogsphere were filled with interesting IT predictions from various experts across the globe. As usual, some of last year’s trends and buzzwords carry over and some new ones emerge. One buzzword which has recently caught my attention is Dark Data. The term Dark Data (to my knowledge) began to surface mid 2012, but has been used more over the last year. Andrew White from Gartner published a blog in July 2012, which is probably the earliest published article referencing Dark Data. The link to that blog is below in the references section. In a nutshell, Dark Data is unstructured, untagged and untapped data which resides in data repositories within an organization’s data center. In some debatable cases, the data has yet to be collected!

This is how Gartner defines Dark Data - http://www.gartner.com/it-glossary/dark-data

Gartner defines dark data as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value

Dark Data Characterized in 3 Ways

  1. Data which has been collected and exists within the organization, in a format which can be processed, but not leveraged by the Business
  2. Data which has been collected residing in your data center which is too costly to process and analyze
  3. The data does not exit. It still needs to be collected.

Is Uncollected Data, Dark Data?

I have read several blogs which also mention that data not captured by the enterprise can be considered Dark Data. This raises a big debate. How do you know what you don’t know? If you are not collecting it, do you care for it? Does it matter to your business? This raises a huge question on how much data, type of data and retention requirements accompany the data you are managing. We hear on a daily basis how the amount of data consumption continues to grow exponentially. I am certain storage vendors will have a field day in 2014 using Dark Data to sell more storage.

Examples of Dark Data in the Data Center

  • Log files residing on servers
  • Emails
  • Social media
  • Video files
  • Audio files

Typically, dark data is complex to analyze and stored in locations where analysis is difficult. Unstructured data is typically poorly managed. A data center can have petabytes of unstructured data that an organization accumulates over time. Dark Data also can include data objects that have not been captured by the enterprise or data outside the organization. The process of turning this data into meaningful “business intelligence” or BI can become a difficult task.

Is Big Data like Dark Data

Because of the “intelligence” or rich information derived from Big Data technologies and processes, we can be tempted to say that Dark Data may not necessarily be the same as Big Data. Over the years, Big Data has become synonymous with structured data and more importantly a way to benefit your business as a whole. While reading about Dark Data, the question I ask myself, is whether the value is in the dark data we don’t know about, or rather the mechanism, processes and procedures to effectively ignore the Dark Data in and “outside” your data center. If a company learns to ignore the noise and stick to a Big Data strategy which is effective and gets the job done, this may be a more effective way to tackle your Dark Data issues or essentially rid yourself of Dark Data altogether. At the same time, saving your business lots of money. Sorry Storage Vendors, I know it’s not what you want to hear, but this allows me to welcome good conversation on Dark Data :-). Vendors like HP have launched their HP Vertica Crane & Flex platform to deal with Dark Data. SAP created the Data Geek challenge featuring their Lumira solution to battle Dark Data as well.

Where does your company stand in the Dark Data battle. Does ignorance to the whole thing make you better off?

Here is a fun Dark Data Video from SAP Lumira I found on YouTube when researching this blog.

Some sites used in researching this blog

Andrew White’s Dark Data Blog from July 11, 2012