Wednesday, December 30, 2015

Volume is Only One of the Four "V"s of Big Data, Especially for the Right Data

One widely accepted definition of Big Data is that it entails four “V”s: volume, velocity, variety, and veracity. In other words, Big Data is defined by there being a great deal of it (volume), coming at us rapidly and continuously (velocity), taking many different forms and types (variety), and originating from trustworthy sources (veracity). Among some people, however, there seems to be more focus on one of the Vs above all others, namely volume. I suppose that is not surprising, given that the adjective qualifying the noun head in Big Data is one that describes size.

However, as I and others have written over the years, there are many aspects of data that are just as important as its quantity. Even worse, I have heard many people imply in their statements about data science that you cannot do real data science without massive amounts of data, in turn requiring massive amounts of storage capacity and computer power (also costing much money).

Make no mistake, we do need to consider the volume aspects of data when discussing data science. But we must not lose in the discussion what we hope to accomplish with the data, which one writer refers to as the fifth V of Big Data, namely value [1]. Sometimes value emanates from harnessing the size of a data set, but other times the veracity or variety take on more importance.

I have written about the importance of value as well, noting that meaningless correlations with large amounts of data do not really mean much of anything, and that data scientists must also understand basic research principles, such as causality. So yes, let us prepare for a future where we leverage Big Data to improve health, biomedicine, and other important societal needs, but we also need to remember that we do not always need massive amounts of data, especially that whose veracity we may not know, to derive other value. Perhaps akin to the “rights” of clinical decision support [2], the best data science is more about having access to the right data using the right amount of data at the right time.

References

1. Marr, B (2015). Why only one of the 5 Vs of big data really matters. IBM Big Data & Analytics Hub. http://www.ibmbigdatahub.com/blog/why-only-one-5-vs-big-data-really-matters.
2. Osheroff, JA, Teich, JM, et al. (2012). Improving Outcomes with Clinical Decision Support: An Implementer's Guide, Second Edition. Chicago, IL, Healthcare Information Management Systems Society.

No comments:

Post a Comment