Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

318

votes

24 answers

Hadoop "Unable to load native-hadoop library for your platform" warning

I'm currently configuring hadoop on a server running CentOs. When I run start-dfs.sh or stop-dfs.sh, I get the following error: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where…

java linux hadoop hadoop2 java.library.path

asked Nov 13 '13 at 01:53

Olshansky

5,904
8
32
47

258

votes

19 answers

Difference between Pig and Hive? Why have both?

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link). I understand that- Pig's language Pig Latin is a shift from(suits the way…

hadoop hive apache-pig

asked Jul 28 '10 at 18:42

Arnkrishn

29,828
40
114
128

251

votes

9 answers

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: Number of data nodes: 3 Data node machine spec: CPU: Core i7-4790 (# of cores: 4, #…

hadoop apache-spark hadoop-yarn

asked Jul 08 '14 at 00:46

zeodtr

10,645
14
43
60

204

votes

5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…

file hadoop hdfs avro parquet

asked Apr 24 '16 at 10:59

Ani Menon

27,209
16
105
126

202

votes

17 answers

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ? From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase. I would also like to know how…

hadoop hbase hive apache-pig

asked Dec 17 '12 at 09:33

Khalefa

2,294
3
14
12

185

votes

16 answers

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to stop all…

python scala apache-spark hadoop pyspark

asked Aug 07 '14 at 22:48

horatio1701d

8,809
14
48
77

175

votes

9 answers

How to copy file from HDFS to the local file system

How to copy file from HDFS to the local file system . There is no physical location of a file under the file , not even directory . how can i moved them to my local for further validations.i am tried through winscp .

hadoop copy hdfs

asked Jul 24 '13 at 15:03

Surya

3,408
5
27
35

165

votes

14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…

scala apache-spark hadoop apache-spark-sql hdfs

asked Apr 17 '15 at 16:10

Donbeo

17,067
37
114
188

155

votes

9 answers

What is the difference between partitioning and bucketing a table in Hive ?

I know both is performed on a column in the table but how is each operation different.

hadoop hive

asked Oct 02 '13 at 02:09

NishM

1,706
2
15
26

150

votes

5 answers

Difference between HBase and Hadoop/HDFS

This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the…

hadoop nosql hbase hdfs difference

asked Jun 05 '13 at 00:49

Dhaval Shah

1,515
2
10
5

142

votes

8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

sorting hadoop mapreduce hdfs shuffle

asked Mar 03 '14 at 08:10

Nithin

9,661
14
44
67

139

votes

9 answers

Name node is in safe mode. Not able to leave

root# bin/hadoop fs -mkdir t mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/t. Name node is in safe mode. not able to create anything in hdfs I did root# bin/hadoop fs -safemode leave But…

hadoop hdfs

asked Apr 04 '13 at 05:34

USB

6,019
15
62
93

130

votes

6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…

hadoop avro parquet

asked Mar 10 '15 at 06:19

Abhishek

6,912
14
59
85

128

votes

31 answers

connect to host localhost port 22: Connection refused

While installing hadoop in my local machine , i got following error ssh -vvv localhost OpenSSH_5.5p1, OpenSSL 1.0.0e-fips 6 Sep 2011 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug2: ssh_connect:…

linux hadoop ssh

asked Jun 27 '13 at 06:10

Surya

3,408
5
27
35

127

votes

14 answers

Chaining multiple MapReduce jobs in Hadoop

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on. So you have the output from the last reduce that is needed as the input for the next…

hadoop mapreduce

asked Mar 23 '10 at 11:55

Niels Basjes

10,424
9
50
66

2 3

…

99 100 Next