5

I am new to Hadoop and HDFS and it confuses me as to why HDFS is not preferred with applications that require low latency. In a big data scenerio, we would have data spread over different community hardware, so accessing the data should be faster.

Greg Chase
  • 173
  • 8
Nidhi
  • 304
  • 1
  • 4
  • 15

3 Answers3

4

Hadoop is completely a batch processing system designed to store and analyze structured, unstructured and semistructured data.

The map/reduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data.

We should not say HDFS is slower, since the HBase no-sql database and MPP based datasources like Impala, Hawq sit on the HDFS. These datasources act faster because they do not follow mapreduce execution for data retrieval and processing.

The slowness occurs only because of the nature of the map/reduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also there are no ways in mapreduce to cache the all/subset of the data in memory.

The Apache Spark is yet another batch processing system but it is relatively faster than Hadoop mapreduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself , eventually writes the data to disk upon completion or whenever required.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
suresiva
  • 3,126
  • 1
  • 17
  • 23
0

There is also the fact that HDFS, as a filesystem, is optimized for big chunks of data. For instance, a single block is usually 64-128 MB instead of the more usual .5 - 4 KB. So even for small operations, there will be significan delay on reading or writing to disk. Add to that the distributed nature of it and you will see significant overhead (indirection, synchronization, replication, etc.) compared to a traditional filesystem.

This from the point of view of HDFS, which I read to be your main question. Hadoop as a data processing framework has its own set of tradeoffs and inefficiencies (better explained on @hserus answer), but they basically aim for the same niche: reliable bulk processing.

Daniel Langdon
  • 5,899
  • 4
  • 28
  • 48
0

The low latency or real time applications usually require specific data. They need to serve quickly some small amount of data which end user or application is waiting for.

The HDFS is designed by storing large data in a distributed environment which provide fault tolerance and high availability. The actual location of the data is known only to the Namenode. It stores the data almost randomly on any Datanode. Again it splits the data files into smaller blocks of fixed size. So the data can be transferred to the real time applications quickly because of the network latency and distribution of the data and filtering of specific data. Where as it help for running MapReduce or data intensive job, because executable program is transferred to the machine which holds the data locally (data locality principle).

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73