2

I want to write a custom load udf in pig for loading files from a directory structure.

The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.

eg: maildir/mailholdername1/inbox/1.txt maildir/mailholdername2/sent/1.txt

I want to read only inbox files from all mailerholdername sub-directories.

I am not able to understand:

  1. what should be passed to the load udf as parameter
  2. how should the entire directory structure be parsed an only respective inbox files are read.

I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records Further, i want to do some operation on these inbox files and extract some data.

akjoshi
  • 15,374
  • 13
  • 103
  • 121
Shrey Shivam
  • 1,107
  • 1
  • 7
  • 16
  • Can you show what you have done so far? – rsp Dec 21 '12 at 10:36
  • actually i have done this through core java.But reading and processing such huge text files of abt 3 gb is very time taking through java.Hence i switched to pig.But now i am not able to do the first step itself...the data is completely unstructured.Its like a normal email text file that we write.Hence we cannot directly load them as there is no schema...so i am not able to move further.. – Shrey Shivam Dec 22 '12 at 18:46
  • Hi Shrey, have you got the answer. – Kshitij Kulshrestha Feb 11 '15 at 12:08

1 Answers1

1

Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:

A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)

You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).

If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Hi Chris,thanks for the reply.But actually the problem is that i want to process inbox files of all the mailholders at one go.Suppose there are 200 mailholders(sub directories under the root directory maildir), i dont want to write LOAD for individual inbox sub directories.I wish to mention root directory and load only inbox files of all sub directories.I donot know how to do this and not sure whether UDF is the solution and how... – Shrey Shivam Dec 22 '12 at 15:12
  • You don't need to write a load statement for each mailbox, the above 'glob' will ensure that all mailboxes will be loaded – Chris White Dec 22 '12 at 15:28
  • okk...yes,this will load all files that i require.Now the actual problem statement is that these text files are actually emails.I need to extract certain attributes(and their values) such as TO,FROM,SUBJECT,DATE etc from these mails.Each text file should generate one record.Hence if there are 100 mails in all inboxes, my relation A(as in the load statement) should contain 100 records.How do i do this processing...because LOAD statement processes structured data (AS(f1,f2)etc),but here data is unstructured.i will have to first extract the attributes i require using some processing.Any idea how? – Shrey Shivam Dec 22 '12 at 18:40
  • In the case where you're processing unstructured text, you'll most probably need to write a `LoadFunc` (unless you can find one already written), and an InputFormat that knows how to process the file format – Chris White Dec 22 '12 at 19:33
  • http://stackoverflow.com/questions/10924922/example-and-more-explanation-about-loadfunc – Chris White Dec 22 '12 at 19:35