Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
318
votes
24 answers

Hadoop "Unable to load native-hadoop library for your platform" warning

I'm currently configuring hadoop on a server running CentOs. When I run start-dfs.sh or stop-dfs.sh, I get the following error: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where…
Olshansky
  • 5,904
  • 8
  • 32
  • 47
258
votes
19 answers

Difference between Pig and Hive? Why have both?

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link). I understand that- Pig's language Pig Latin is a shift from(suits the way…
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
251
votes
9 answers

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: Number of data nodes: 3 Data node machine spec: CPU: Core i7-4790 (# of cores: 4, #…
zeodtr
  • 10,645
  • 14
  • 43
  • 60
204
votes
5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
202
votes
17 answers

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ? From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase. I would also like to know how…
Khalefa
  • 2,294
  • 3
  • 14
  • 12
185
votes
16 answers

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to stop all…
horatio1701d
  • 8,809
  • 14
  • 48
  • 77
175
votes
9 answers

How to copy file from HDFS to the local file system

How to copy file from HDFS to the local file system . There is no physical location of a file under the file , not even directory . how can i moved them to my local for further validations.i am tried through winscp .
Surya
  • 3,408
  • 5
  • 27
  • 35
165
votes
14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…
Donbeo
  • 17,067
  • 37
  • 114
  • 188
155
votes
9 answers

What is the difference between partitioning and bucketing a table in Hive ?

I know both is performed on a column in the table but how is each operation different.
NishM
  • 1,706
  • 2
  • 15
  • 26
150
votes
5 answers

Difference between HBase and Hadoop/HDFS

This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the…
Dhaval Shah
  • 1,515
  • 2
  • 10
  • 5
142
votes
8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Nithin
  • 9,661
  • 14
  • 44
  • 67
139
votes
9 answers

Name node is in safe mode. Not able to leave

root# bin/hadoop fs -mkdir t mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/t. Name node is in safe mode. not able to create anything in hdfs I did root# bin/hadoop fs -safemode leave But…
USB
  • 6,019
  • 15
  • 62
  • 93
130
votes
6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…
Abhishek
  • 6,912
  • 14
  • 59
  • 85
128
votes
31 answers

connect to host localhost port 22: Connection refused

While installing hadoop in my local machine , i got following error ssh -vvv localhost OpenSSH_5.5p1, OpenSSL 1.0.0e-fips 6 Sep 2011 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug2: ssh_connect:…
Surya
  • 3,408
  • 5
  • 27
  • 35
127
votes
14 answers

Chaining multiple MapReduce jobs in Hadoop

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on. So you have the output from the last reduce that is needed as the input for the next…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
1
2 3
99 100