2

I have a CSV file with two header row.I want to remove those. how to delete first two rows of a CSV file in hive or PIG?First couple of rows of file are given below:

YEAR    QUARTER MONTH   DAY_OF_MONTH    DAY_OF_WEEK FL_DATE UNIQUE_CARRIER  AIRLINE_ID  CARRIER TAIL_NUM    FL_NUM  ORIGIN  ORIGIN_CITY_NAME    ORIGIN_STATE_ABR    ORIGIN_STATE_FIPS   ORIGIN_STATE_NM ORIGIN_WAC  DEST    DEST_CITY_NAME  DEST_STATE_ABR  DEST_STATE_FIPS DEST_STATE_NM   DEST_WAC    CRS_DEP_TIME    DEP_TIME    DEP_DELAY   DEP_DELAY_NEW   DEP_DEL15   DEP_DELAY_GROUP DEP_TIME_BLK    TAXI_OUT    WHEELS_OFF  WHEELS_ON   TAXI_IN CRS_ARR_TIME    ARR_TIME    ARR_DELAY   ARR_DELAY_NEW   ARR_DEL15   ARR_DELAY_GROUP ARR_TIME_BLK    CANCELLED   CANCELLATION_CODE   DIVERTED    CRS_ELAPSED_TIME    ACTUAL_ELAPSED_TIME AIR_TIME    FLIGHTS DISTANCE    DISTANCE_GROUP  CARRIER_DELAY   WEATHER_DELAY   NAS_DELAY   SECURITY_DELAY  LATE_AIRCRAFT_DELAY
YEAR    QUARTER MONTH   DAY_OF_MONTH    DAY_OF_WEEK FL_DATE UNIQUE_CARRIER  AIRLINE_ID  CARRIER TAIL_NUM    FL_NUM  ORIGIN  ORIGIN_CITY_NAME    ORIGIN_STATE_ABR    ORIGIN_STATE_FIPS   ORIGIN_STATE_NM ORIGIN_WAC  DEST    DEST_CITY_NAME  DEST_STATE_ABR  DEST_STATE_FIPS DEST_STATE_NM   DEST_WAC    CRS_DEP_TIME    DEP_TIME    DEP_DELAY   DEP_DELAY_NEW   DEP_DEL15   DEP_DELAY_GROUP DEP_TIME_BLK    TAXI_OUT    WHEELS_OFF  WHEELS_ON   TAXI_IN CRS_ARR_TIME    ARR_TIME    ARR_DELAY   ARR_DELAY_NEW   ARR_DEL15   ARR_DELAY_GROUP ARR_TIME_BLK    CANCELLED   CANCELLATION_CODE   DIVERTED    CRS_ELAPSED_TIME    ACTUAL_ELAPSED_TIME AIR_TIME    FLIGHTS DISTANCE    DISTANCE_GROUP  CARRIER_DELAY   WEATHER_DELAY   NAS_DELAY   SECURITY_DELAY  LATE_AIRCRAFT_DELAY
2015    1   1   1   4   2015-01-01  AA  19805   AA  N787AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  6   California  91  900 855 -5  0   0   -1  0900-0959   17  912 1230    7   1230    1237    7   7   0   0   1200-1259   0       0   390 402 378 1   2475    10          
2015    1   1   2   5   2015-01-02  AA  19805   AA  N795AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  6   California  91  900 850 -10 0   0   -1  0900-0959   15  905 1202    9   1230    1211    -19 0   0   -2  1200-1259   0       0   390 381 357 1   2475    10          
dipayan
  • 72
  • 9
  • 1
    Can you simply delete all rows that start with "YEAR"? – Prune Sep 17 '15 at 21:01
  • thanks a lot for your answer...can you please provide me the code to do it?is it going to be in PIG or HIVE? – dipayan Sep 17 '15 at 21:05
  • 2
    http://stackoverflow.com/questions/17810537/how-to-delete-and-update-a-record-in-hive Search for "PIG delete rows from table" – Prune Sep 17 '15 at 21:19

1 Answers1

4

Try this. Modify it as per your requirement: I have loaded this a single line for each row, you can define the columns for each field as well.

a = LOAD 'file.csv' using TextLoader() as (line:chararray); b = FILTER a by SUBSTRING(line,0,4) != 'YEAR'; dump b;

Or using Hive:

 Create table temp ( Col1 string, col2 int and so on) 
row format delimited fields terminated BY '\t' lines terminated BY '\n'
tblproperties("skip.header.line.count"="2");

LOAD data 'file path' into table temp;

This will remove the first 2 rows and load the remaining records

  • thanks Harish for your answer....but variable b returns only those rows which start with 'YEAR' .So I wrote in below given manner...still it doesn't work...can anyone please help.. b = FILTER a by SUBSTRING(line,0,4) != 'YEAR'; As per me it should give result set except the rows with 'YEAR'.... – dipayan Sep 18 '15 at 15:23
  • Hi Dipayan.. I have updated to filter for Not equal to YEAR. It is working fine for me. Can you please post the code you are trying? – HarishKotha Sep 18 '15 at 15:37
  • 1st.grunt> d = FILTER a by SUBSTRING(line,0,4) != 'YEAR'; 2nd.grunt> store d into '/user/hadoop/abp_final5' using PigStorage (',','-schema'); 3rd.hduser@greycampus:~$ hadoop fs -getmerge /user/hadoop/csvoutput_final ./abp_final_try.csv; These are my codes step by step... – dipayan Sep 18 '15 at 15:42
  • can you please give a dump on a and d and post the sample results of each here? – HarishKotha Sep 18 '15 at 16:03
  • I have added one more way to do the same with hive. Please try that option as well if needed. – HarishKotha Sep 18 '15 at 18:50