Big Data & all the Hadoopla
HADOOP that is synonymously used with BigData is a name formed based on a little child saying that word seeing a Toy Elephant. For many Data Architects and new world of Data Scientists, the technical toy versions of HADOOP and its variants abound – but what exactly are we Analyzing? Is the analysis Business Outcome related is a more than million dollar question taking into account all the IT majors and Cloud providers have the Big Data Environments ready for Enterprises to roll in their Data for Analytics.
Variety, Volume, Velocity of Data is touted as driving force behind Big Data Innovations in Global Enterprises. The unstructured data comprising of content in varied channels like Social along additional Smart devices and machine data brings in a large Variety of Data sets along with heavy volumes and velocity of peer social groups across the globe sharing product purchasing preferences to design requirements to their needs. Of course one should not forget the fourth V – Value as our Big Data fellow bloggers from Aditi mention in their Big Data landscape blog- http://blog.aditi.com/data/big-data-landscape/
The HADOOP Mothership – Apache provides an Open Source foundational technology bringing a File System Architecture – HDFS and set of tools to handle large sets Variety, Volume and Velocity of Data. Some names of tools like PIG (Language of the analytics of Big Data), HIVE (Datawarehouse!) may also amaze those who have been in the C++ , Java or Scripts – Perl/PHP/Python era . As with Java and other OPEN ecosystems from Apache this foundational technology is then commercialized through several Cloud variants to make it OEM’able so that large or Niche IT product /solution vendors can build their Layer cake of solutions with many decorations and keeping it easier for Big Data Solution building for Enterprises – this also may mean complex skills and increasing complexity of data integration, quality, management and all those baggage that come along with LARGE sets of data.
Hortonworks from some of the HADOOP Apache project colleagues and more enthusiasts started on the Big data platform so that other vendors can build their Cloud and Big Data ecosystems on top. The sandbox for Hortonworks platform provides a quick view and entry to architects and developers. Terradata and Hortonworks partnership provides best of HADOOP and Terradata for large datawarehouse clients. Apart from Terradata , Cloud hosting provider like Rackspace + Hortonworks provides Big Data platform in the world of open ecosystem. Microsoft embraced HADOOP for Windows flavor leveraging Hortonworks-Microsoft-Partnership. Hortonworks also is available on Open Operating systems like Linux. Microsoft Azure Cloud HDInsight service uses Hortonworks platform supplemented with Microsoft suite of tools including System Center.
The Cloud Big Data variants are Cloudera leveraged by IT cloud /data vendors like HP, Informatica, Oracle & Terradata that are all well known to Datawarehouse analysts , technologists and Implementers. As you can see the partner ecosystem for Cloudera spans from building reference architecture to building connectors to include and extend already robust datawarehouse products/ tools and appliances.
MapR is another Big Data Solution platform that is leveraged by many Cloud and IT vendors. MapR has proprietary implementation of HDFS – the foundational filesystem altered for high performance. Amazon AWS Elastic MapReduce [ EMR] provides MapR for Apache Hadoop based Big Data environment. Google also provides MapR Big Data environments apart from their assortment of Bigtable and other Big Data solutions. Apart from Amazon/AWS and Google, Cisco’s Big Data Hadoop platform leverages MapR. MapR study to provide guidance around HADOOP distribution illustrates importance of knowing some of the variants and features to be compared.
Apart from the above other breed of large data analytics platforms in the Open Source that several APPS leverage with NOSQL /Columnar indexes are Apache Hbase (Hadoop based DB), Apache Cassandra, CouchBase, MongoDB etc
Open PaaS on Cloud Ecosystems have also started embracing BigData with Open Stack joining Hortonworks.
All the best practices on Data Architecture, Methodology, Analysis, Governance, Monitoring and Management still applies to this new breed of Analytics platform in the ever Agile Business-Ready and friendly world of Business Technology. If an Organization has winked away from Relational Database era skipping these essential pre-reqs before charting on their Data warehouse journey, the same failures faced would appear much quicker for them in their quest for mining GOLD from Big Data projects started in their silos of Lines of Business – LoB’s. There is no magic bullet as any new emerging technology that wields a lot of methods/tools/processes also comes with its own Governance/ Risk / Compliance – GRC umbrella to plug into existing GRC frameworks. As Data Scientists discover new frontiers of Data Analytics, Enterprises Data management governance and management leadership needs to wake up to ground realities before embarking their Next Datawarehouse frontier journey!
Open Group keynote on Big Data with Ford provides some of the features and use cases that are possible with Big Data Analytlcs. The Open Group also has Big Data project that members are participating currently. If you need more info on Big Data project at Open group please reach out to Kapil Bakshi or Sundar Ramanathan.