A Perl script to process a CSV file, aggregating properties spread over multiple records

Question

Sorry for the vague question, I'm struggling to think how to better word it!

I have a CSV file that looks a little like this, only a lot bigger:

The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:

550672,1;2;3;4
656372,1;2
766153,1;4

etc.

I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!

Look into using some csv modules. `Text::CSV` is a favorite of mine. — squiguy, Sep 16 '12 at 22:43

score 4 · Accepted Answer · answered Sep 16 '12 at 23:37

I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience

It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new;

open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
  $csv->parse($line) or die "Invalid data line";
  my ($key, $val) = $csv->fields;
  push @{ $data{$key} }, $val
}

for my $id (sort keys %data) {
  printf "%s,%s\n", $id, join ';', @{ $data{$id} };
}

output

550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3

Thank you for editing my question for the proper formatting, I will make sure in future I follow that! I agree sometimes the best way to learn is by example, this is very kind of you to have taken the time to write this out for me, I look forward to deconstructing it to understand it! Edit: This is very neat, thank you for this. — user1597452, Sep 17 '12 at 07:04

score 3 · Answer 2 · edited May 23 '17 at 12:12

Firstly props for seeking an approach not a solution. As you've probably already found with perl, There Is More Than One Way To Do It.

The approach I would take would be;

use strict;  # will save you big time in the long run

my %ids      # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
  split line into ID and property variable  # google the split function
  append new property to existing properties for this id in the hash table  # If it doesn't exist already, it will be created
}

foreach my $key (keys %ids) {
  deduplicate properties
  print/display/do whatever you need to do with the result
}

This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem. A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.

Check out this question for a discussion on how to do the deduplication.

Gah! Again I wish I could give you an up arrow! This is very helpful, thank you very much. Great description, I will be going through this thoroughly- really enjoying learning this stuff! — user1597452, Sep 16 '12 at 22:09

Piotr Wadas · Answer 3 · 2012-09-16T22:02:48.460

2

Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.

and here you go

dtpwmbp:~ pwadas$ cat input.txt 
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl 
#!/opt/local/bin/perl

my %hash;
while (<>)
{
    chomp;
    my($key, $value) = split /,/;
    push @{$hash{$key}} , $value ;
}

foreach my $key (sort keys %hash)
{
     print $key . "," . join(";", @{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl 
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$

edited Sep 16 '12 at 22:02

answered Sep 16 '12 at 21:47

Piotr Wadas

1,838
1
10
13

Thanks ever so much for the speedy reply, I have been reading about hash and suspected that might come into play. I'm going to get stuck into reading about these elements. If you don't mind, if I get stuck again, can I come back to you? Thanks again ever so much (I would like to click the up arrow to say your answer was useful but unfortunately it seems I require more rep!) – user1597452 Sep 16 '12 at 21:54
Note, that such approach as presented. joins duplicate keys somewhat automatically. One could use some Text::CSV module for that, however someone else could use a one-liner for it :) – Piotr Wadas Sep 16 '12 at 22:05
hashes of arrays, etc, are not as straightforward as you would hope in perl, check out amon's answer from this post, and read the linked docs: http://stackoverflow.com/questions/12450851/grabbing-a-list-from-a-multi-dimensional-hash-in-perl – carillonator Sep 16 '12 at 22:06
I don't think I can type as fast as you can code.... one day... one day... thanks again, there are a lot of new terms in there for me to learn- I'm looking forward to going through it! – user1597452 Sep 16 '12 at 22:11
lol, actually saving and prepending a text to paste with four spaces each line to mark it as code when pasted in here took more time and clicks on the damn MBP than coding ;-) thx :) – Piotr Wadas Sep 16 '12 at 22:12

score 2 · Answer 4 · answered Sep 17 '12 at 09:56

2

perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'

answered Sep 17 '12 at 09:56

Vijay

65,327
90
227
319

Thank you for the answer, although I've got to admit that at my current state of perl knowledge, understanding one liners such as this is very challenging. I do find it amazing, however, that the solution to my problem can be compressed into something as succinct as this! – user1597452 Sep 17 '12 at 10:32
In fact, I think your answer is helpful as it demonstrates what is possible with such a short amount of code, thank you. – user1597452 Sep 17 '12 at 10:38

score 1 · Answer 5 · answered Sep 16 '12 at 23:13

Another (not perl) way which incidentally is shorter and more elegant:

#!/opt/local/bin/gawk -f

BEGIN {FS=OFS=",";}

NF > 0 { IDs[$1]=IDs[$1] ";" $2; }

END { for (i in IDs) print i, substr(IDs[i], 2); }

The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.

The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

Thank you for the suggestion- although I'm not sure my brain is big enough for two programming languages at once! — user1597452, Sep 17 '12 at 07:10

A Perl script to process a CSV file, aggregating properties spread over multiple records

5 Answers5