-1

How to read the.doc file using Apache pig Latin programming using map reduce


A = load './pig/test.docx';

B = foreach A generate flatten(TextLoader((chararray)$0)) as word;

C = group B by word;

D = foreach C generate COUNT(B), group;

store D into './wordcountone';


Prasanna
  • 21
  • 4
  • 1
    If you are really just interested in doing things like word counting and don't need all the extra markup inherent in Word files, the best solution is almost certainly going to be finding a piece of software to convert them to plaintext files for you. – reo katoa Aug 26 '13 at 13:11

1 Answers1

0

You would need to create a custom load function for your pig script. First start with simple .doc or .docx parsing with java, some example available here: How read Doc or Docx file in java? but I'm sure you will find more on google.

Once you know how to get your data from the Word document you need to implement your pig function.

Example of custom pig loader (step by step) can be found here

Community
  • 1
  • 1
Kris
  • 5,714
  • 2
  • 27
  • 47