2

As part of my project, I have a CSV file which has comma separated values. But there are few fields which are in quotes. As a result the data is not loaded correctly. For example: if the data is ==>> car, deer, "bear, cat"

In the above example, ideally there should be 3 columns. But its being treated as 4 columns due to the presence of a comma between rat and mat. The field "rat,mat" is not clubbed and considered as a single field.

Please suggest if there is something in PIG to handle such scenarios

4 Answers4

1

you can use apache CSV library to handle this

Patrick Chan
  • 1,019
  • 10
  • 14
0

You can try out this regex:

have a look here, already ask on SO, very well explained by the user.

click here

Community
  • 1
  • 1
Ankur Singhal
  • 26,012
  • 16
  • 82
  • 116
0

Can you try this?
input.csv

car,deer,"bear,cat"
car,deer,"bear,cat"

PigScript: output format1:

A = LOAD 'input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\w+),(\\w+),(.*)$')) AS (col1:chararray,col2:chararray,col3:chararray);
DUMP B;

Output:

(car,deer,"bear,cat")
(car,deer,"bear,cat")

PigScript Output format2:

A = LOAD 'input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\w+),(\\w+),"(\\w+),(.*)"$')) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray);
DUMP B;

Output:

(car,deer,bear,cat)
(car,deer,bear,cat)
Sivasakthi Jayaraman
  • 4,724
  • 3
  • 17
  • 27
-1
     public static void main(String[] args) {
        // TODO code application logic here
        String str = "c,b,\"c,d\"";
        System.out.println(str);
           if(str.contains("\"")){
              str= str.replaceAll("\"", " ");
              //System.out.println(str);
              str = str.replaceAll(" ", "");
              System.out.println(str);

           }
    }       

enter image description here

Az.MaYo
  • 1,044
  • 10
  • 23