I'm trying to practice some data mining algorithms using hadoop. Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?
10 Answers
I've found a university site with some exercises and solutions for MapReduce that build only on Hadoop:
http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html
Additionally there are courses from Yahoo and Google:
http://developer.yahoo.com/hadoop/tutorial/
http://code.google.com/edu/parallel/index.html
All these courses work on plain Hadoop, to answer your question.

- 2,833
- 2
- 28
- 36
-
1+1 for yahoo. Id take the simple yahoo tutorials and expand on them. Make the input files MUCH bigger, change the map/reduce functions, go from a single instance to a small cluster and continually expand on what you have done previously. – Ralph Willgoss Jul 20 '10 at 07:08
Start with plain mapreduce at beginner level. You can try Pig/Hive/Hbase at the next level.
You will not be able appreciate Pig/Hive/Hbase unless you struggle enough to use plain map reduce

- 2,810
- 1
- 22
- 23
I'm trying to practice some data mining algorithms using hadoop.
Use Apache mahout which runs on top of Hadoop. [http://mahout.apache.org/][1]
Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?
HDFS is a file system of hadoop. It stands for Hadoop distributed file system. No matter what tool you are going to use in Hadoop stack, they should process the data which is in the distributed environment. So, you can't do anything just with HDFS. You need any of the computation technicques/tools like Map Reduce, Pig, Hive and etc.
Hope this helps!

- 831
- 2
- 11
- 21
I would also recommend the umd site. However it looks like you are completely new to Hadoop. I woudl recommend the book "Hadoop: THe Definant Guide" by Tom White. Its a bit dated [meant for the 0.18 version, rather than the latest 0.20+). Read it, do the examples and you should be at a better place to judge how to structure your project.

- 14,156
- 17
- 75
- 124
Hadoop is a tool for Distributed/parallel data processing. Mahout is a data mining/ machine learning framework that can work standalone mode as well as in Hadoop distribution environment. The decision to use it as standalone or with Hadoop boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop.
Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Weka is another similar open source projects. All these come under a category called machine learning frameworks.
Refer to the blog which talks about a use case about how Mahout and Hadoop distributed file system works? As a precursor to this, there is also a blog on Component architecture of how each of these tools fit together for a data mining problem in Hadoop /Mahout ecosystem.
You can use R, Spark Hadoop together as complete open source solution.
R- Statistical language which provides many libraries out of box.
Spark- framework for data processing faster then MR with machine learning algorithms.
Hadoop- Data storage which is scalable and robust based on commodity hardware.

- 1,445
- 21
- 29
You could also use Mahout http://mahout.apache.org/
It is a machine-learning and data-mining library that can be used on top of Hadoop.
In general Mahout currently supports (taken from Mahout site):
- Collaborative Filtering
- User and Item based recommenders
- K-Means, Fuzzy K-Means clustering
- Mean Shift clustering
- Dirichlet process clustering
- Latent Dirichlet Allocation
- Singular value decomposition
- Parallel Frequent Pattern mining
- Complementary Naive Bayes classifier
- Random forest decision tree based classifier

- 228
- 1
- 8
You have to use different tools in Hadoop ecosystem depending on their strengths.
Hive and Hbase are good to handle structured data
Sqoop is used to import structured data from traditional RDBMS database Oracle, SQL Server etc.
Flume is used for processing Un-structured data.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE (e.g. Image Id , MD5SUM of image
)
To process Big data streaming, you can use Pig
Spark is a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation
Have a look at Structured Data and Un-Structured data handling in Hadoop
Have a look at complete hadoop ecosystem and this SE question

- 1
- 1

- 37,698
- 11
- 250
- 211
depends on your application. you need to understand purpose of hive,pig, hbase and then you can figure out where exactly they fit in your application. these are created a specific reasons that you need to understand simple google will get the results

- 172
- 1
- 12
HDFS is a distributed storage system to dump your data for further analytics.
Hive/Pig/MR/Spark/Scala etc.... are tools for analyzing the data. You actually write your algorithms in any of these. You can't achieve 100% just only by Pig/Hive/Hbase. You should know how to write Map Reduce algorithms and need to import these into Hive/Pig.
ETL tools: Pig (Scripting language) Hive (SQl like query language for structured data) HBASE for Unstructured data you can achieve real time data analyzation. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. Sqoop : Import/Export data from RDDMS Flume: Import streaming data to hadoop Mahout: Machine learning algorith tool
Hadoop Definitive guide is good to start for beginners.

- 3,568
- 26
- 37