Hadoop beginners

Question

I'm trying to practice some data mining algorithms using hadoop. Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?

score 17 · Accepted Answer · answered Jul 19 '10 at 06:22

17

I've found a university site with some exercises and solutions for MapReduce that build only on Hadoop:

http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html

Additionally there are courses from Yahoo and Google:

http://developer.yahoo.com/hadoop/tutorial/

http://code.google.com/edu/parallel/index.html

All these courses work on plain Hadoop, to answer your question.

answered Jul 19 '10 at 06:22

Thomas Koch

2,833
2
28
36

1

+1 for yahoo. Id take the simple yahoo tutorials and expand on them. Make the input files MUCH bigger, change the map/reduce functions, go from a single instance to a small cluster and continually expand on what you have done previously. – Ralph Willgoss Jul 20 '10 at 07:08

score 3 · Answer 2 · answered Aug 18 '10 at 21:22

3

Start with plain mapreduce at beginner level. You can try Pig/Hive/Hbase at the next level.

You will not be able appreciate Pig/Hive/Hbase unless you struggle enough to use plain map reduce

answered Aug 18 '10 at 21:22

Harsha Hulageri

2,810
1
22
23

+1. It defiantly is worth the pain. – sholsapp Aug 18 '10 at 23:00

score 2 · Answer 3 · answered Jan 18 '15 at 18:22

I'm trying to practice some data mining algorithms using hadoop.

Use Apache mahout which runs on top of Hadoop. [http://mahout.apache.org/][1]

Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?

HDFS is a file system of hadoop. It stands for Hadoop distributed file system. No matter what tool you are going to use in Hadoop stack, they should process the data which is in the distributed environment. So, you can't do anything just with HDFS. You need any of the computation technicques/tools like Map Reduce, Pig, Hive and etc.

Hope this helps!

score 2 · Answer 4 · answered Jul 19 '10 at 06:25

I would also recommend the umd site. However it looks like you are completely new to Hadoop. I woudl recommend the book "Hadoop: THe Definant Guide" by Tom White. Its a bit dated [meant for the 0.18 version, rather than the latest 0.20+). Read it, do the examples and you should be at a better place to judge how to structure your project.

score 1 · Answer 5 · edited May 29 '19 at 08:40

Hadoop is a tool for Distributed/parallel data processing. Mahout is a data mining/ machine learning framework that can work standalone mode as well as in Hadoop distribution environment. The decision to use it as standalone or with Hadoop boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop.

Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Weka is another similar open source projects. All these come under a category called machine learning frameworks.

Refer to the blog which talks about a use case about how Mahout and Hadoop distributed file system works? As a precursor to this, there is also a blog on Component architecture of how each of these tools fit together for a data mining problem in Hadoop /Mahout ecosystem.

score 1 · Answer 6 · answered Oct 18 '15 at 16:28

You can use R, Spark Hadoop together as complete open source solution.

R- Statistical language which provides many libraries out of box.
Spark- framework for data processing faster then MR with machine learning algorithms.
Hadoop- Data storage which is scalable and robust based on commodity hardware.

score 1 · Answer 7 · answered Jan 11 '12 at 15:31

You could also use Mahout http://mahout.apache.org/

It is a machine-learning and data-mining library that can be used on top of Hadoop.

In general Mahout currently supports (taken from Mahout site):

Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier

score 0 · Answer 8 · edited May 23 '17 at 12:32

You have to use different tools in Hadoop ecosystem depending on their strengths.

Hive and Hbase are good to handle structured data

Sqoop is used to import structured data from traditional RDBMS database Oracle, SQL Server etc.

Flume is used for processing Un-structured data.

You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE (e.g. Image Id , MD5SUM of image)

To process Big data streaming, you can use Pig

Spark is a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation

Have a look at Structured Data and Un-Structured data handling in Hadoop

Have a look at complete hadoop ecosystem and this SE question

score 0 · Answer 9 · answered Jul 06 '16 at 05:17

depends on your application. you need to understand purpose of hive,pig, hbase and then you can figure out where exactly they fit in your application. these are created a specific reasons that you need to understand simple google will get the results

score 0 · Answer 10 · answered Jul 16 '16 at 05:55

HDFS is a distributed storage system to dump your data for further analytics.

Hive/Pig/MR/Spark/Scala etc.... are tools for analyzing the data. You actually write your algorithms in any of these. You can't achieve 100% just only by Pig/Hive/Hbase. You should know how to write Map Reduce algorithms and need to import these into Hive/Pig.

ETL tools: Pig (Scripting language) Hive (SQl like query language for structured data) HBASE for Unstructured data you can achieve real time data analyzation. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. Sqoop : Import/Export data from RDDMS Flume: Import streaming data to hadoop Mahout: Machine learning algorith tool

Hadoop Definitive guide is good to start for beginners.

Hadoop beginners

10 Answers10