To hadoop or not to hadoop

Question

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.

I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).

There is just one problem that i keep coming up against namely 'Use hadoop if you have more than 10gb of data'

Now we don't even have 1gb of data(at this stage) is it still viable.

My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.

thanks

1GB is not big data. It's actually average-. 10GB isn't either. Data warehouse benchmarks start at 100GB for small DWs these days. A single 4 year old laptop could easily handle the 10GB load. In fact, you could put all the data in memory,e g with SQL Server 2014 or 2016. As for cheap and fast, it's fast *only* if you have a cluster in which case it's *not* cheap. — Panagiotis Kanavos, May 31 '16 at 09:45
"You don't have big data unless you are adding 5TB/year." That's a quote from a recent conference. A relevant blog post [is this](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html) although Excel can summarize millions of data rows using columnstores, it just wont *display* all of them — Panagiotis Kanavos, May 31 '16 at 09:50
@Pintac : refer to this article:https://www-01.ibm.com/software/in/data/bigdata/ and http://stackoverflow.com/questions/32538650/hadoop-comparison-to-rdbms/32546933#32546933 and — Ravindra babu, May 31 '16 at 15:08

Makubex · Accepted Answer · 2016-05-31T13:06:55.700

Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.

Let me explain why I suggest against hadoop for such low volume of data. By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.

score 0 · Answer 2 · answered Jun 13 '16 at 05:23

0

As @Makubex has suggested, don't use hadoop.

And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.

Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.

Refer: When to use T-SQL or SSIS for ETL

answered Jun 13 '16 at 05:23

Ani Menon

27,209
16
105
126

Hi i i went a alternative way. Seeing that i know that we will be getting much more data in the not to far future i went the hadoop nodejs streaming way but instead of using hadoop i am just calling the node.js script directly and then when the day the data is huge i wil just slot haddop in. Only problem is i have to write the node.js scripts the hadoop way. thanks – Pintac Jun 14 '16 at 08:01

To hadoop or not to hadoop

2 Answers2