I want to write a custom load udf in pig for loading files from a directory structure.
The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.
eg: maildir/mailholdername1/inbox/1.txt maildir/mailholdername2/sent/1.txt
I want to read only inbox files from all mailerholdername sub-directories.
I am not able to understand:
- what should be passed to the load udf as parameter
- how should the entire directory structure be parsed an only respective inbox files are read.
I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records Further, i want to do some operation on these inbox files and extract some data.