0

Difference between map-reduce ,hive ,pig

pig : its a data flow language, it can work on any data basically used to convert semi structure ,unstructured data to structure so that can be used in hive advance analytics using windowing function etc.

Hive : Work on structure data and provide sql type query language .

I know at back end both pig and hive uses map -reduces .

I know map-reduce can be good tool for programmer ,hive or pig for sql guy

I just want to know is there any specific use cases where we go for hive,pig and map-reduce

basically we decide that we have to use pig here hive here or we must use map -reduce .

user3484461
  • 1,113
  • 11
  • 14
  • 1
    Duplicate of http://stackoverflow.com/questions/17950248/pig-vs-hive-vs-native-map-reduce/17964271#17964271 – alexeipab Oct 30 '14 at 20:25

3 Answers3

0

Map-Reduce: Has better performance than pig or hive but requires more development time.

PIg: Less development time but poor performance when compared to map-reduce.

Hve: SQL type language with some good features like partitioning and bucketing to improve performance reads.Also, hive enforces schema on read.

darkknight444
  • 546
  • 8
  • 21
  • how map-reduce will have better performance than pig or hive for instance you have to join the data so writing map reduce is in efficient as you don't have much options. suppose you have 2 big tables to join which can't fit in the memory so how would you do joins in map -reduce – user3484461 Oct 30 '14 at 04:07
  • Not entirely true that map-reduce has more performance than Pig or Hive. But we can say that map-reduce is more low level than Pig and Hive. So it allows more flexibility but it takes more code and time to write it. – Luís Bianchin Mar 02 '15 at 20:55
0

Pig is used to format your unstructured/semi structure data format.Lets say you have a timestamp in your data which is not as per Hive timestamp format.You can convert same using pigUDF and format your data.This is just a example to explain.You can do many more things using Pig.

Hive is basically used for structured data .This maynot work well with unstructured data.This takes more time to execute as it converts into Mapreduce job.I suggest you to use impala which is much faster than hive.

Amaresh
  • 3,231
  • 7
  • 37
  • 60
0

Pig is a data flow language. This means that you can not use if statements or loops. If you need to do a lot of repetition, it would be preferable to learn mapreduce.

You are able to get around this by embedding pig into a python script but this would take even longer since it would have to load all the jar files with every iteration of the loop.

Basically it boils down to how much time you spend prototyping vs. how much production work you have. If you are a data scientist or an analyst, most of your work is new projects that require a lot of prototyping. This means that you care about getting results fast. Then you would prefer Pig or Hive. If you are in a development team, you want to build robust code based on agreed upon methodology that does not need to be tested and then you would prefer mapreduce.

There are companies like Cloudera that provide a package of Pig, Hive, and other Hadoop tools so you wouldn't have to choose between the two.

Michal
  • 1,863
  • 7
  • 30
  • 50