Loading new files using Pig LOAD statement

Question

I wanted to load data from HDFS to HBSE table sing PIG script.

I have hadfs folder structure as below:

-rw-r--r--  1 user supergroup   63 2014-05-15 20:28 dataparse/good/goodrec_051520142028
-rw-r--r--  1 user supergroup   72 2014-05-15 20:30 dataparse/good/goodrec_051520142030
-rw-r--r--  1 user supergroup   110 2014-05-15 20:32 dataparse/good/goodrec_051520142032

In the above all filenames are attached with the timestamp.

Below is my PIG script to load from HDFS to HBASE:

G = LOAD '/user/user/dataparse/good/' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('t1:name t1:state t1:phone_no t1:gender');

The script is working fine and the data from all the 3 files are written to the Hbase "test" table.

Suppose after some time if some more files comes to HDFS with the same structure and when i run the pig script it will LOAD all the files in the "good" directory along with the already read file. So how can i load only those files which are new files. Already loaded files should not be loaded again into my HBASE table.

How can i do this?

Thanks, Sapthashree

Any updates on the above post? – shree11 May 16 '14 at 04:34 — shree11, May 16 '14 at 04:34

score 0 · Answer 1 · edited May 23 '17 at 10:29

0

I think you have a few options here.

Using globs

Using a shell script pick up the "new" files, Use the glob feature so that multiple files can be fed into the script. A related use case is here
If the files have a date and timestamp in the filename then you can use globs directly, look here to inspiration

Using big guns

If using globs is failing you, then you need to bring out the big guns, use a custom load function put in the logic to identify "new files" in it and you should be good to go. Details here

edited May 23 '17 at 10:29

Community

1
1

answered May 15 '14 at 16:34

Sudarshan

8,574
11
52
74

Hi,I gone through the link you suggested where glob example is explained. But using glob we can provide the pattern to read the specific files in the directory. If if i have new files coming into the directory and each time do i need to change the glob pattern ?I wanted a pattern such a way that each time when the new file is added, only those new files should be read not the old files.How this can be done using glob or other alternative? – shree11 May 19 '14 at 04:51
You try using a glob pattern, which is based on the system time or the like. So if your files are named based on date and timestamp and if your glob is also based on date and timestamp, it should work – Sudarshan May 19 '14 at 05:01

score 0 · Answer 2 · edited May 23 '17 at 12:18

0

you need to have some scheduling mechanism where pig job runs time to time. So, in this process you can only process the files which are not processed earlier by keep traking the timestamp and file names or any other field.

See here for more information Execute Pig from within Java Application

edited May 23 '17 at 12:18

Community

1
1

answered Feb 23 '17 at 07:18

rakeeee

973
4
19
44

Loading new files using Pig LOAD statement

2 Answers2