1

I have an ASCII input file for my project and I use pig script to do mapreducing. In this script I take specified char intervals with using substring. I want to ask if I use java to take char intervals and then embed the jar file into another pig script which reduce my data, my program runs faster or not?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
erbileren
  • 152
  • 2
  • 9

1 Answers1

0

It all depends of how you implement your char intervals split in your map method. Substring can be optimized maybe if you know your data. Check this thread:

charAt() or substring? Which is faster?

Also, in general, adding jars to a hadoop cluster adds some overhead for the file transfer and setting up internal stuff (classloaders, unpacking, etc), but in this case the jar size should be negligible. So, in short, adding your java code to do the mapper should not add significant overhead, but can improve the mapper phase if pig generated code was not optimal and your java code was optimal for your strings.

Community
  • 1
  • 1
Nikola Yovchev
  • 9,498
  • 4
  • 46
  • 72
  • I don't need to any optimization to Substring so I think if I do the same works in java the speed will be almost the same. Am I right? – erbileren Dec 10 '12 at 16:29
  • I am unsure if there is a way to check easily the pig generated instructions, but keep in mind that if the pig generated instructions are not optimal, java might in some cases perform better, due to bytecode optimization, but can also be slower due to object generation overhead and garbage collection. I wouldn't think the difference will be big, but I wouldn't take my word for it as I am not that familiar with pig internals as I am with jvm internals. I would benchmark both approaches for a small map input file and see if it makes any difference. – Nikola Yovchev Dec 10 '12 at 18:57