If the query you show is the only kind of query you want to perform then this is rather straight-forward.
my $orders; # I guess
while (my $row = <DATA> ) {
chomp $row;
my @fields = split /,/, $row;
push @{ $orders->{$fields[0]}->{$fields[1]} } \@fields; # or as a hashref, but that's larger
}
print join "\n", @{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer
__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info
You just build an index into a hash reference that is several levels deep. The first level has the customer. It contains another hashref, which has the list of rows that match this index. Then you can decide if you just want the whole thing as an array ref, or if you want to put a hash ref with keys there. I went with an array ref because that consumes less memory.
Later you can query it easily. I included that above. Here's the output.
Cutomer1
Product1
cat1
many
other
info
If you don't want to remember indexes but have to code a lot of different queries, you could make variables (or even constants) that represent the magic numbers.
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
# build $orders ...
my $res = $orders->{Cutomer1}->{Product2}->[0];
print "Category: " . $res->[CATEGORY];
The output is:
Category: cat2
To order the result, you can use Perl's sort
. If you need to sort by two columns, there are answers on SO that explain how to do that.
for my $res (
sort { $a->[OTHER] cmp $b->[OTHER] }
@{ $orders->{Customer2}->{Product1} }
) {
# do stuff with $res ...
}
However, you can only search by Customer and Product like this.
If there is more than one type of query, this gets expensive. If you would also group them by category only, you would either have to iterate all of them every single time you look one up, or build a second index. Doing that is harder than waiting a few extra seconds, so you probably don't want to do that.
I was wondering if it okay to store this complete CSV in some complex perl Data structure?
For this specific purpose, absolutely. 20 Megabytes are not a lot.
I've created a test file that is 20004881 bytes and 447848 lines with this code, which is not perfect, but gets the job done.
use strict;
use warnings;
use feature 'say';
use File::stat;
open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
my $customer = 'Customer' . int rand 10_000;
my $product = 'Product' . int rand 500;
my $category = 'cat' . int rand 7;
say $fh join ',', $customer, $product, $category, qw(many other info);
}
Here is an excerpt of the file:
$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info
Now let's run our above program with this input file and look at the memory consumption and some statistics of the size of the data structure.
use strict;
use warnings;
use Devel::Size 'total_size';
use constant {
CUSTOMER => 0,
PRODUCT => 1,
CATEGORY => 2,
MANY => 3,
OTHER => 4,
INFO => 5,
};
open my $fh, '<', 'test.csv' or die $!;
my $orders;
while ( my $row = <$fh> ) {
chomp $row;
my @fields = split /,/, $row;
$orders->{ $fields[0] }->{ $fields[1] } = \@fields;
}
say 'total size of $orders: ' . total_size($orders);
Here it is:
total size of $orders: 185470864
So that variable consumes 185 Megabytes. That's a lot more than the 20MB of CSV, but we have an easily searchable index. Using htop I figured out that the actual process consumes 287MB. My machine has 16G of memory, so I don't care about that. And with about 3.6s it's reasonably fast to run this program, but I have an SSD a newish CORE i7 machine.
But it will not eat all your memory if you have 500MB to spare. Likely an SQLite approach would consume less memory, but you have to benchmark the speed of this versus the SQLite approach to decide which one is fater.
I used the method described in this answer to read the file into an SQLite database1. I needed to add a header line to the file first, but that's trivial.
$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test
Since I couldn't measure this properly, let's say it felt like about 2 seconds. Then I added an index for the specific query.
sqlite> CREATE INDEX foo ON test ( customer, product );
This felt like it took another one second. Now I could query.
sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info
The result appeared instantaneously (which is not scientific!). Since we didn't measure how long retrieval from the Perl data structure takes, we cannot compare them, but it feels like it all takes about the same time.
However, the SQLite file size is only 38839296, which is about 39MB. That's bigger than the CSV file, but not by a lot. It seems like the sqlite3 process only consumes about 30kB of memory, which I find weird given the index.
In conclusion, the SQLite seems to be a bit more convenient and eat less memory. There is nothing wrong with doing this in Perl, and it might be the same speed, but using SQL for this type of query feels more natural, so I would go with this.
If I might be so bold I would assume you didn't set an index on your table when you did it in SQLite and that made it take longer. The amount of rows we have here is not that much, even for SQLite. Properly indexed it's a piece of cake.
If you don't actually know what an index does, think about a phone book. It has the index of first letters on the sides of the pages. To find John Doe, you grab D, then somehow look. Now imagine there was no such thing. You need to randomly poke around a lot more. And then try to find the guy with the phone number 123-555-1234. That's what your database does if there is no index.
1) If you want to script this, you can also pipe or read the commands into the sqlite3
utility to create the DB, then use Perl's DBI to do the querying. As an example, sqlite3 foo.db <<<'.tables\ .tables'
(where the backslash \
represents a literal linebreak) prints the list of tables twice, so importing like this will work, too.