Questions tagged [hdfs]

Hadoop Distributed File System (HDFS) is the default file storage system used by Apache Hadoop. HDFS creates multiple replicas of data blocks and distributes them on data nodes throughout a cluster to enable reliable, and computation of huge amount of data on commodity hardware.

Apache Hadoop Wiki HDFS

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

The following are some of the salient features that could be of interest to many users.

  • Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.
  • HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.
  • Hadoop is written in Java and is supported on all major platforms.
  • Hadoop supports shell-like commands to interact with HDFS directly.
  • The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.
  • New features and improvements are regularly implemented in HDFS.
    The following is a subset of useful features in HDFS:
  • File permissions and authentication.
  • Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
  • Safemode: an administrative mode for maintenance.
  • fsck: a utility to diagnose health of the file system, to find missing files or blocks.
  • Rebalancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
  • Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
  • Secondary NameNode (deprecated): performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode. Replaced by Checkpoint node.
  • Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode. NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
  • Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
  • HDFS Federation: In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces.
8294 questions
204
votes
5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
175
votes
9 answers

How to copy file from HDFS to the local file system

How to copy file from HDFS to the local file system . There is no physical location of a file under the file , not even directory . how can i moved them to my local for further validations.i am tried through winscp .
Surya
  • 3,408
  • 5
  • 27
  • 35
165
votes
14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…
Donbeo
  • 17,067
  • 37
  • 114
  • 188
150
votes
5 answers

Difference between HBase and Hadoop/HDFS

This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the…
Dhaval Shah
  • 1,515
  • 2
  • 10
  • 5
142
votes
8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Nithin
  • 9,661
  • 14
  • 44
  • 67
139
votes
9 answers

Name node is in safe mode. Not able to leave

root# bin/hadoop fs -mkdir t mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/t. Name node is in safe mode. not able to create anything in hdfs I did root# bin/hadoop fs -safemode leave But…
USB
  • 6,019
  • 15
  • 62
  • 93
126
votes
12 answers

The way to check a HDFS directory's size?

I know du -sh in common Linux filesystems. But how to do that with HDFS?
Cheng
  • 4,816
  • 4
  • 41
  • 44
126
votes
8 answers

what's the difference between "hadoop fs" shell commands and "hdfs dfs" shell commands?

Are they supposed to be equal? but, why the "hadoop fs" commands show the hdfs files while the "hdfs dfs" commands show the local files? here is the hadoop version information: Hadoop 2.0.0-mr1-cdh4.2.1 Subversion …
Charlie Lin
  • 1,613
  • 2
  • 13
  • 16
125
votes
6 answers

How does Hadoop process records split across block boundaries?

According to the Hadoop - The Definitive Guide The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
90
votes
7 answers

hadoop copy a local file system folder to HDFS

I need to copy a folder from local file system to HDFS. I could not find any example of moving a folder(including its all subfolders) to HDFS $ hadoop fs -copyFromLocal /home/ubuntu/Source-Folder-To-Copy HDFS-URI
Tariq
  • 2,274
  • 4
  • 24
  • 40
77
votes
12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?
Yuval
  • 7,987
  • 12
  • 40
  • 54
72
votes
3 answers

Differences between Amazon S3 and S3n in Hadoop

When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3:// did not work. When looking for some help on the Internet I found I can use S3n. When I used S3n it worked. I do not understand the differences between…
user1355361
70
votes
5 answers

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail: ~$ hadoop fs -tail /path/to/file This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell…
bbengfort
  • 5,254
  • 4
  • 44
  • 57
69
votes
10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…
samthebest
  • 30,803
  • 25
  • 102
  • 142
67
votes
4 answers

How to fix corrupt HDFS FIles

How does someone fix a HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully someone who has run into this problem before can tell me how to fix this. Unlike a traditional fsck…
Classified
  • 5,759
  • 18
  • 68
  • 99
1
2 3
99 100