1

Because meantime i wrote an answer to the question, what got closed - trying to reword and re-ask it.

Having an CSV file with 180 milions records, with 5 columns as:

"c a","L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)","C & P_L",1,0

How to change it to the 3 column structure as:

"c a|L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)|C & P_L",1,0

e.g. need concatenate the colums 1,2,3 with | and print it as one column and leave other colums unchanged

Tried it with regexes:

cat RelatedKW.csv | perl -pe 's/(\|)/\//g'| perl -pe 's/("\s*"|"\s*"\s*\\n$)//g'| perl -pe 's/^,"|,,|"\s*,\s*\"/|/g' | perl -pe 's/\"(\d+),(\d+)\"/ |$1|$2/g' > newRKW4.csv`

Is here any better way?

Community
  • 1
  • 1
kobame
  • 5,766
  • 3
  • 31
  • 62

2 Answers2

1

You should generally avoid parsing CSVs with regex, as Kent Fredric explains in answer to another similar question:

Not using CPAN is really a recipe for disaster.

Please consider this before trying to write your own CSV implementation. Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.

It is really bad practice trying to parse CSVs with regexes, because for example, you need to handle:

  • escaped quotes
  • escaped separator characters
  • fields containing the delimiter

and so on, all of which Text::CSV will handle for you.

Here's a solution that uses Text::CSV. I'm not a Perl expert, so the following code may be missing some things, but it is probably better than using regexes:

perl -MText::CSV_XS -E '$csv = Text::CSV_XS->new ({ eol => $/ }); $csv->print(*STDOUT, [join(q{|}, @$row[0..2]), @$row[3..4]]) while ($row = $csv->getline(*STDIN))' < csv

Input:

"c a","L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)","C & P_L",1,0

Output:

"c a|L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)|C & P_L",1,0

Some potential problems: it doesn't handles escaping of the | character, if there are any in the input, no error handling, etc. For a better solution you need to write a full-featured Perl script and not a one-liner.

Community
  • 1
  • 1
kobame
  • 5,766
  • 3
  • 31
  • 62
  • You might be surprised but regexes are not always right tool for the job, and CSV parsing is contrary to popular belief *not* trivial. So there are choices, either to reinvent your own (broken?) wheel, or to use right tool for the job. – mpapec Feb 24 '15 at 15:28
  • 1
    I understand that you're trying to respond directly to the OP of a question that was already closed, but the "dialog" (e.g. "On the other side, I understand you. You're probably not an programmer.") doesn't really make sense here. I've cleaned it up so your answer will actually make sense to other users. Instead of posting a duplicate question, I think you should have edited the original so it's not too broad and then people could have voted to re-open if they thought the question had value for the site. – ThisSuitIsBlackNot Feb 24 '15 at 15:43
  • 1
    @ThisSuitIsBlackNot Yeah. I understand - and you're right - thank you for the edit. (Sorry, I only get a bit upset by some comments and need cooled down). Editing the original question and vote for reopen would be sure the best way. – kobame Feb 24 '15 at 16:15
  • I appreciate your efforts on this topic and the fact that you suggested `Text::CSV` +1 – hek2mgl Feb 25 '15 at 01:03
  • @Сухой27 I wouldn't be surprised - as you can see yourself, i'm using Text::CSV - so, I don't understanding your comment - what is _reinvented_ in my (as you said - my broken?) solution? Could you be please more specific? – kobame Feb 26 '15 at 09:43
  • @kobame I was referring to regex only solution. – mpapec Feb 26 '15 at 10:32
0

Assuming your data is exactly like what it is this should work

$line =~ s-\",\"-|-g;
sam
  • 1,280
  • 2
  • 11
  • 20