BIG DATA A BIG PROBLEM…

Siddhi Dhamale
3 min readSep 17, 2020

Before we actually begin with what is big data,I have one question for you have you ever created multiple gmail,facebook,instagram or for that matter any social media accounts and the obvious answer should be yes.This the point where one user starts contributing to large amount of data and point is some of it might be completely useless but it just keeps on occupying the space.Alas, being a user its not our concern to think about how to handle the data and what measure should one take to optimize the data storage.

Here comes the role of data scientists to look for optimized solutions for the the storing the enormous data contributed by millions of users like you,in an secure and efficient way.And it was hard to manage the huge amount of data coming which resulted to return the web search results and would load the data at lower pace.

To encounter this problem Doug Cutting and Mike Cafarella started with Nutch an open source web search engine.Later the Nutch project was divided — the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.

Since then Hadoop has played an prominent role in managing big data problems like velocity,volume,etc by the means of distributive storage methodology forming an cluster consisting of master-slave topology and by letting the slave node contribute their storage to master node,which solved the cost issues of many making the free from buying huge space for storage.

And with the time there came much better products providing better ways to store and manage data like Google BigQuery,Cloudera,DataBricks,Mircosoft SQL making new developers who aren’t familir with using Java and MapReducing technology of hadoop with simple languages like SQL to do the same but again hadoop has it own merits and has largely benifited many companies like :

1.FacebookSocial Site.. →8 cores and 12 TB of storage.Used as a source for reporting and machine learning

2.Twitter →Social site.Hadoop is used since 2010 to store and process tweets, log files using LZO compression technique as it is fast and also helps release CPU for other tasks.

3.LinkedInSocial site,2X4 and 2X6 cores — 6X2TB SATA,4100 nodes

LinkedIn’s data flows through Hadoop clusters.User activity, server metrics, images,transaction logs stored in HDFS are used by data analysts for business analytics like discovering people you may know.

4Yahoo!Online Portal,4500 nodes — 1TB storage, 16 GB RAM,Used for scaling tests

5.AlibabaE-Commerce.Processes 15-node cluster business data.Analyzes vertical search engine

6.CloudspaceIT developer.Specializes in designing and building web applications

7.FOX Audience NetworkNews TV Channel,30–70 machine clusters,Used for log analysis and machine learning

8.AdobePublishing and editing software,30 nodes running HDFS, 5 to 14 nodes HBase,Social services to structured data storage

9.InfosysIT Consulting,Per client requirements,Client projects in finance, telecom and retail.

10.CognizantIT Consulting,Per client requirements,Client projects in finance, telecom and retail,13

11.AccentureIT Consulting,Per client requirements,Client projects in finance, telecom and retail.

20 nodes cluster — 53.3 TB Storage

Processes customers and operations log

12.Ning →Social Network Platform,8 cores — 16 GB RAM,Used for reporting and analytics

13.Rackspace →Web hosting service,30 node cluster — 4–8GB RAM, 1.5TB/node storage,Indexing logs from email hosting system for search

14.Rakuten →Ecommerce,69 node cluster,Analyze log and mine data

15.Powerset / Microsoft →Natural Language Search.Used for Data Storage

16.Sling Media →Television service provider.10-Node cluster.Run algorithms on a number of raw data,

17.Spotify →Digital music platform,690 node cluster — 38TB RAM, 28 PB storage,Used for content generation, data aggregate.

Hope you got the idea of big data and products used for the same.Thank You!

--

--