-2

I was trying to solve one problem for a while now but without success. In start it looks like a trivial issue but I have stacked with it...

Anyhow, I need to solve following problem. I have very large CSV file with lines in following format:

NUMBER(9);NUMBER(1);NUMBER(9-10);NUMBER(2);NUMBER(1);...;NUMBER(2);NUMBER(1);STRING;DATE(DD.MM.YYYY);NUMBER(1351)

for example:

517755369;1;0001303717;48;1;63;8;50;2;51;6;53;7;55;3;57;4;59;5;;;;;CALL;07.12.2012;1351

In each line after first tree fields I have 1 to 10 pairs NUMBER(2);NUMBER(1), followed by another three fields STRING;DATE(DD.MM.YYYY);NUMBER(1351).

I need to transform that file in file with following structure:

517755369;1;0001303717;48;1;CALL;07.12.2012;1351
517755369;1;0001303717;63;8;CALL;07.12.2012;1351
517755369;1;0001303717;50;2;CALL;07.12.2012;1351
517755369;1;0001303717;51;6;CALL;07.12.2012;1351
517755369;1;0001303717;53;7;CALL;07.12.2012;1351
517755369;1;0001303717;55;3;CALL;07.12.2012;1351
517755369;1;0001303717;57;4;CALL;07.12.2012;1351
517755369;1;0001303717;59;5;CALL;07.12.2012;1351`

So each line from input file should be transformed to as many lines as original line has NUMBER(2);NUMBER(1) pairs.

Here is a sample of input file:

517760344;2;000601301061;31;1;;;;;;;;;;;;;;;;;;;CALL;07.12.2012;1351
518855369;1;000601303717;48;1;63;8;50;2;51;6;53;7;55;3;57;4;59;5;;;;;CALL;07.12.2012;1351
519775067;1;000601300771;4;2;6;3;19;1;;;;;;;;;;;;;;;CALL;07.12.2012;1351
617773407;1;000603252922;13;1;17;2;27;3;;;;;;;;;;;;;;;CALL;07.12.2012;1351
717764779;1;000601304021;31;1;;;;;;;;;;;;;;;;;;;CALL;07.12.2012;1351`

In general I need some regexp that I can use with sed or awk (or some perl script I can run against input file). The original input file has roughly 1–1.5M records. This task should be finished as quickly as possible (up to 5 minutes for transformation).

Thanks

amon
  • 57,091
  • 2
  • 89
  • 149
d3ky
  • 31
  • 5
  • 4
    A regex seems like the wrong tool for the job here. – Wooble Jan 25 '13 at 18:40
  • 3
    You seem to have forgotten to ask a question. – ikegami Jan 25 '13 at 18:40
  • Question is How to do it? – d3ky Jan 25 '13 at 18:42
  • Have you tried a CSV parser? – nhahtdh Jan 25 '13 at 18:47
  • @nhahtdh, No I have not. This transformation is just one part of the job (the tricky part). Problem is that I 'm limited to tools that I have on my HP-UX box which is quite archaic :-) – d3ky Jan 25 '13 at 18:54
  • 1
    The transformation is the easy part. Just parse the line into an array, splice the elements off start and end, then just loop over the remaining elements and print as needed. – TLP Jan 25 '13 at 18:56
  • Check this: http://stackoverflow.com/questions/3065095/how-do-i-efficiently-parse-a-csv-file-in-perl – nhahtdh Jan 25 '13 at 18:57
  • A good tool for the job is awk, using ';' as field separator (see the man page, look for FS). You don't really need regular expressions for splitting. – naitoon Jan 25 '13 at 18:58
  • 1
    Split at ';', sort, print! –  Jan 25 '13 at 19:06
  • @naitoon, Could you pleas post any hint? I tried with awk (it was my first choice), but I failed. I Couldn't solve situations with variable number of NUMBER(2);NUMBER(1) pairs in different lines... – d3ky Jan 25 '13 at 19:07
  • 2
    Don't use regular expressions. Use the Text::CSV_XS module to take care of it for you. – Andy Lester Jan 25 '13 at 19:10

4 Answers4

2

Perhaps the following will be helpful:

use strict;
use warnings;

while (<>) {
    chomp;
    print +( join ';', ( split /;/ )[ 0 .. 4, -3 .. -1 ] ) . "\n";

}

Output on your data:

517760344;2;000601301061;31;1;CALL;07.12.2012;1351
518855369;1;000601303717;48;1;CALL;07.12.2012;1351
519775067;1;000601300771;4;2;CALL;07.12.2012;1351
617773407;1;000603252922;13;1;CALL;07.12.2012;1351
717764779;1;000601304021;31;1;CALL;07.12.2012;1351

Usage: perl file.csv >out.csv.

It appears that you want the first five fields and the last three. The above splits on ;, rejoins with ; and then prints the modified record.

Kenosis
  • 6,196
  • 1
  • 16
  • 16
  • I want the first tree fields and any pair of numbers which follows in separate line. for example 518855369;1;000601303717;48;1;CALL;07.12.2012;1351 and 518855369;1;000601303717;63;8;CALL;07.12.2012;1351 and so on... – d3ky Jan 25 '13 at 19:33
2

Ideas from @Kenosis, but different interpretation of specs:

use strict;
use warnings;

while (<DATA>) {
    chomp;
    my @fields = split /;/;
    my $f = 3;
    while ($fields[$f]) {
      print join( ';', @fields[0 .. 2, $f, $f + 1, -3 .. -1]), "\n";
      $f += 2;
    }
}

__DATA__
517760344;2;000601301061;31;1;;;;;;;;;;;;;;;;;;;CALL;07.12.2012;1351
518855369;1;000601303717;48;1;63;8;50;2;51;6;53;7;55;3;57;4;59;5;;;;;CALL;07.12.2012;1351
519775067;1;000601300771;4;2;6;3;19;1;;;;;;;;;;;;;;;CALL;07.12.2012;1351
617773407;1;000603252922;13;1;17;2;27;3;;;;;;;;;;;;;;;CALL;07.12.2012;1351
717764779;1;000601304021;31;1;;;;;;;;;;;;;;;;;;;CALL;07.12.2012;1351

output:

perl 14528210.pl
517760344;2;000601301061;31;1;CALL;07.12.2012;1351
518855369;1;000601303717;48;1;CALL;07.12.2012;1351
518855369;1;000601303717;63;8;CALL;07.12.2012;1351
518855369;1;000601303717;50;2;CALL;07.12.2012;1351
518855369;1;000601303717;51;6;CALL;07.12.2012;1351
518855369;1;000601303717;53;7;CALL;07.12.2012;1351
518855369;1;000601303717;55;3;CALL;07.12.2012;1351
518855369;1;000601303717;57;4;CALL;07.12.2012;1351
518855369;1;000601303717;59;5;CALL;07.12.2012;1351
519775067;1;000601300771;4;2;CALL;07.12.2012;1351
519775067;1;000601300771;6;3;CALL;07.12.2012;1351
519775067;1;000601300771;19;1;CALL;07.12.2012;1351
617773407;1;000603252922;13;1;CALL;07.12.2012;1351
617773407;1;000603252922;17;2;CALL;07.12.2012;1351
617773407;1;000603252922;27;3;CALL;07.12.2012;1351
717764779;1;000601304021;31;1;CALL;07.12.2012;1351
Ekkehard.Horner
  • 38,498
  • 2
  • 45
  • 96
1

This might work for you (GNU sed):

sed -r 's/^(([^;]*;){3})(([0-9]+;){2})(([0-9]*;)*)(([^;]*;?){3})$/\1\3\7\n\1\5\7/;Ta;P;:a;D' file
  • s/^(([^;]*;){3})(([0-9]+;){2})(([0-9]*;)*)(([^;]*;?){3})$/\1\3\7\n\1\5\7/ this substitution command constructs two strings. The first is the intended string followed by a newline, the second is the original string less the first pair of numbers. So ^(([^;]*;){3}) represents the first three fields, (([0-9]+;){2}) the first pair of numbers,(([0-9]*;)*) the remaining pairs of numbers and (([^;]*;?){3})$ the last three fields.
  • Ta if the substitution command fails jump to label a
  • P print upto the first newline in the pattern space.
  • :a label a
  • D delete upto and including the first newline and begin the next cycle. Do not read in another line until the pattern space is empty.

So, in essence, the s/.../.../ and the D commands are used to invoke a loop which prints a constructed string until the substitution command fails which then ends the loop. The remaining string is deleted completely and the next line is begun.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Great job, this is excellent solution. I will appreciate if you could explain me this regex. Thanks in advance. – d3ky Jan 25 '13 at 22:22
1

Here's a solution in one-liner form, using the Text::CSV module.

perl -MText::CSV -lwe '$c = Text::CSV->new({
    sep_char=>';',
    eol=>$/
});                  
while($r = $c->getline(*STDIN)) { 
    my @a = splice @$r,0,3;            # remove 3 first elements
    my @c = splice @$r,-3;             # remove 3 last elements
    @$r = grep $_ ne '', @$r;          # remove empty elements
    while(@$r) {                       # while array is not empty
        $c->print(*STDOUT, [@a, splice(@$r,0,2),@c]);    # print all elements
    } }"

Output:

517755369;1;0001303717;48;1;CALL;07.12.2012;1351
517755369;1;0001303717;63;8;CALL;07.12.2012;1351
517755369;1;0001303717;50;2;CALL;07.12.2012;1351
517755369;1;0001303717;51;6;CALL;07.12.2012;1351
517755369;1;0001303717;53;7;CALL;07.12.2012;1351
517755369;1;0001303717;55;3;CALL;07.12.2012;1351
517755369;1;0001303717;57;4;CALL;07.12.2012;1351
517755369;1;0001303717;59;5;CALL;07.12.2012;1351

So basically, like I said in my comment, remove first 3 elements and 3 last elements and store in separate arrays. Remove empty elements. Loop around the remaining elements and print as necessary.

TLP
  • 66,756
  • 10
  • 92
  • 149
  • This is nice piece of code. I already find similar solution (using Text::CSV modul) online but I'm not sure if Text::CSV module is available on our HP-UX box which has, as I mentioned above, quite archaic setup. I will test this also when back to work on Monday and let you know. On my Ubuntu box at home it works fine. – d3ky Jan 26 '13 at 08:26