Questions tagged [hdfs]

Hadoop Distributed File System (HDFS) is the default file storage system used by Apache Hadoop. HDFS creates multiple replicas of data blocks and distributes them on data nodes throughout a cluster to enable reliable, and computation of huge amount of data on commodity hardware.

Apache Hadoop Wiki HDFS

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

The following are some of the salient features that could be of interest to many users.

Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.
HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.
Hadoop is written in Java and is supported on all major platforms.
Hadoop supports shell-like commands to interact with HDFS directly.
The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.
New features and improvements are regularly implemented in HDFS.
The following is a subset of useful features in HDFS:
File permissions and authentication.
Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
Safemode: an administrative mode for maintenance.
fsck: a utility to diagnose health of the file system, to find missing files or blocks.
Rebalancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
Secondary NameNode (deprecated): performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode. Replaced by Checkpoint node.
Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode. NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
HDFS Federation: In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces.

8294 questions

204

votes

5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…

asked Apr 24 '16 at 10:59

Ani Menon

27,209
16
105
126

175

votes

9 answers

How to copy file from HDFS to the local file system

How to copy file from HDFS to the local file system . There is no physical location of a file under the file , not even directory . how can i moved them to my local for further validations.i am tried through winscp .

hadoop copy hdfs

asked Jul 24 '13 at 15:03

Surya

3,408
5
27
35

165

votes

14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…

scala apache-spark hadoop apache-spark-sql hdfs

asked Apr 17 '15 at 16:10

Donbeo

17,067
37
114
188

150

votes

5 answers

Difference between HBase and Hadoop/HDFS

This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the…

hadoop nosql hbase hdfs difference

asked Jun 05 '13 at 00:49

Dhaval Shah

1,515
2
10
5

142

votes

8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

sorting hadoop mapreduce hdfs shuffle

asked Mar 03 '14 at 08:10

Nithin

9,661
14
44
67

139

votes

9 answers

Name node is in safe mode. Not able to leave

root# bin/hadoop fs -mkdir t mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/root/t. Name node is in safe mode. not able to create anything in hdfs I did root# bin/hadoop fs -safemode leave But…

hadoop hdfs

asked Apr 04 '13 at 05:34

USB

6,019
15
62
93

126

votes

12 answers

The way to check a HDFS directory's size?

I know du -sh in common Linux filesystems. But how to do that with HDFS?

hadoop command-line directory hdfs

asked Jun 28 '11 at 09:07

Cheng

4,816
4
41
44

126

votes

8 answers

what's the difference between "hadoop fs" shell commands and "hdfs dfs" shell commands?

Are they supposed to be equal? but, why the "hadoop fs" commands show the hdfs files while the "hdfs dfs" commands show the local files? here is the hadoop version information: Hadoop 2.0.0-mr1-cdh4.2.1 Subversion …

hadoop hdfs

asked Aug 09 '13 at 08:37

Charlie Lin

1,613
2
13
16

125

votes

6 answers

How does Hadoop process records split across block boundaries?

According to the Hadoop - The Definitive Guide The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than…

hadoop split mapreduce hdfs

asked Jan 12 '13 at 07:10

Praveen Sripati

32,799
16
80
117

votes

7 answers

hadoop copy a local file system folder to HDFS

I need to copy a folder from local file system to HDFS. I could not find any example of moving a folder(including its all subfolders) to HDFS $ hadoop fs -copyFromLocal /home/ubuntu/Source-Folder-To-Copy HDFS-URI

hadoop hdfs

asked Jan 29 '15 at 11:05

Tariq

2,274
4
24
40

votes

12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?

hadoop hive hdfs

asked Feb 20 '11 at 16:43

Yuval

7,987
12
40
54

votes

3 answers

Differences between Amazon S3 and S3n in Hadoop

When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3:// did not work. When looking for some help on the Internet I found I can use S3n. When I used S3n it worked. I do not understand the differences between…

hadoop amazon-s3 hdfs

asked May 13 '12 at 05:04

user1355361

votes

5 answers

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail: ~$ hadoop fs -tail /path/to/file This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell…

hadoop hdfs

asked Nov 04 '13 at 22:05

bbengfort

5,254
4
44
57

votes

10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…

scala hadoop output hdfs apache-spark

asked Jun 02 '14 at 12:54

samthebest

30,803
25
102
142

votes

4 answers

How to fix corrupt HDFS FIles

How does someone fix a HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully someone who has run into this problem before can tell me how to fix this. Unlike a traditional fsck…

hadoop hdfs

asked Oct 06 '13 at 03:17

Classified

5,759
18
68
99

2 3

…

99 100 Next