2

I am receiving some CSVs from client. The average size of these CSVs is 20 MB.

The format is:

Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info

My current approach: I store all these records temporarily in a table, and then query in table:

where customer='customer1' and product='product1'
where customer='customer1' and product='product2'
where customer='customer2' and product='product1'

Problem : inserting in DB and then selecting takes too much time. A lot of stuff is happening and it takes 10-12 minutes to process one CSV. I am currently using SQLite and it is quite fast. But I think I'll save some more time if I remove the insertion and selection altogether.

I was wondering if it okay to store this complete CSV in some complex perl Data structure?

The machine generally has 500MB+ free RAM.

GrSrv
  • 551
  • 1
  • 4
  • 22
  • 2
    How many queries do you need to do? Using SQLite is a good approach here because you can also make it index stuff for you. Building your own index in a Perl data structure is easy if it's always the same query, but you need more memory if have more than one index. But if you don't index, your searches will be slow. DBI has a CSV driver, but that might be slow as well. – simbabque Jan 11 '17 at 17:59
  • 10k queries per CSV. A lot of stuff is happening and it takes 10-12 minutes to process one CSV. I am currently using SQLite and it is quite fast. But I think I'll save some more time if I remove the insertion and selection altogether. Never heard of "index in a Perl data structure". Could you give a link? – GrSrv Jan 11 '17 at 18:08
  • 2
    You misunderstood me. What you need is called a _search index_. I wanted to say that you need to make your own such search index and put it inside of Perl data structures. See my answer. But the most important question is **How many different queries do you do?**. – simbabque Jan 11 '17 at 18:11
  • 2
    For bulk inserts into a database, you normally want to use the ability to insert from a CSV file. But if you get rid of the database afterwards, you'd be better off avoid the database entirely and just use a hash of hashes. – ikegami Jan 11 '17 at 18:14
  • @ikegami I was going to add that to my answer, but I'll leave it to you. – simbabque Jan 11 '17 at 18:15
  • @simbabque Just one type of query as mentioned in the question, with different values of course. I also use order by in that query. – GrSrv Jan 11 '17 at 18:19
  • 2
    @simbabque, SQLite doesn't appear to have an SQL command to do this, but [using the `sqlite3` tool to create the initial database](http://stackoverflow.com/a/1758033/589924) from the CSV should be faster than using Perl. – ikegami Jan 11 '17 at 18:27
  • @ikegami actually the highest-voted answer in the question you linked explains how to let the `sqlite3` tool do it It's not an SQL command, but it can still be passed to the utility. I've tried it and updated my answer with a comparison of both approaches. – simbabque Jan 12 '17 at 11:41
  • @simbabque, Not sure what your point is, but I didn't link to the top-voted question because it only lists part of the process, while the linked answer explains the entire process. – ikegami Jan 12 '17 at 15:15

1 Answers1

2

If the query you show is the only kind of query you want to perform then this is rather straight-forward.

my $orders; # I guess
while (my $row = <DATA> ) {
    chomp $row;
    my @fields = split /,/, $row;

    push @{ $orders->{$fields[0]}->{$fields[1]} } \@fields; # or as a hashref, but that's larger
}

print join "\n", @{ $orders->{Cutomer1}->{Product1}->[0] }; # typo in cuStomer

__DATA__
Cutomer1,Product1,cat1,many,other,info
Cutomer1,Product2,cat1,many,other,info
Cutomer1,Product2,cat2,many,other,info
Cutomer1,Product3,cat1,many,other,info
Cutomer1,Product3,cat7,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat1,many,other,info
Cutomer2,Product5,cat4,many,other,info
Cutomer3,Product7,cat,many,other,info

You just build an index into a hash reference that is several levels deep. The first level has the customer. It contains another hashref, which has the list of rows that match this index. Then you can decide if you just want the whole thing as an array ref, or if you want to put a hash ref with keys there. I went with an array ref because that consumes less memory.

Later you can query it easily. I included that above. Here's the output.

Cutomer1
Product1
cat1
many
other
info

If you don't want to remember indexes but have to code a lot of different queries, you could make variables (or even constants) that represent the magic numbers.

use constant {
    CUSTOMER => 0,
    PRODUCT  => 1,
    CATEGORY => 2,
    MANY     => 3,
    OTHER    => 4,
    INFO     => 5,
};

# build $orders ...

my $res = $orders->{Cutomer1}->{Product2}->[0];

print "Category: " . $res->[CATEGORY];

The output is:

Category: cat2

To order the result, you can use Perl's sort. If you need to sort by two columns, there are answers on SO that explain how to do that.

for my $res ( 
    sort { $a->[OTHER] cmp $b->[OTHER] } 
    @{ $orders->{Customer2}->{Product1} } 
) {
    # do stuff with $res ...
}

However, you can only search by Customer and Product like this.

If there is more than one type of query, this gets expensive. If you would also group them by category only, you would either have to iterate all of them every single time you look one up, or build a second index. Doing that is harder than waiting a few extra seconds, so you probably don't want to do that.


I was wondering if it okay to store this complete CSV in some complex perl Data structure?

For this specific purpose, absolutely. 20 Megabytes are not a lot.

I've created a test file that is 20004881 bytes and 447848 lines with this code, which is not perfect, but gets the job done.

use strict;
use warnings;
use feature 'say';
use File::stat;

open my $fh, '>', 'test.csv' or die $!;
while ( stat('test.csv')->size < 20_000_000 ) {
    my $customer = 'Customer' . int rand 10_000;
    my $product  = 'Product' . int rand 500;
    my $category = 'cat' . int rand 7;
    say $fh join ',', $customer, $product, $category, qw(many other info);
}

Here is an excerpt of the file:

$ head -n 20 test.csv
Customer2339,Product176,cat0,many,other,info
Customer2611,Product330,cat2,many,other,info
Customer1346,Product422,cat4,many,other,info
Customer1586,Product109,cat5,many,other,info
Customer1891,Product96,cat5,many,other,info
Customer5338,Product34,cat6,many,other,info
Customer4325,Product467,cat6,many,other,info
Customer4192,Product239,cat0,many,other,info
Customer6179,Product373,cat2,many,other,info
Customer5180,Product302,cat3,many,other,info
Customer8613,Product218,cat1,many,other,info
Customer5196,Product71,cat5,many,other,info
Customer1663,Product393,cat4,many,other,info
Customer6578,Product336,cat0,many,other,info
Customer7616,Product136,cat4,many,other,info
Customer8804,Product279,cat5,many,other,info
Customer5731,Product339,cat6,many,other,info
Customer6865,Product317,cat2,many,other,info
Customer3278,Product137,cat5,many,other,info
Customer582,Product263,cat6,many,other,info

Now let's run our above program with this input file and look at the memory consumption and some statistics of the size of the data structure.

use strict;
use warnings;
use Devel::Size 'total_size';

use constant {
    CUSTOMER => 0,
    PRODUCT  => 1,
    CATEGORY => 2,
    MANY     => 3,
    OTHER    => 4,
    INFO     => 5,
};

open my $fh, '<', 'test.csv' or die $!;

my $orders;
while ( my $row = <$fh> ) {
    chomp $row;
    my @fields = split /,/, $row;

    $orders->{ $fields[0] }->{ $fields[1] } = \@fields;
}

say 'total size of $orders: ' . total_size($orders);

Here it is:

total size of $orders: 185470864

So that variable consumes 185 Megabytes. That's a lot more than the 20MB of CSV, but we have an easily searchable index. Using htop I figured out that the actual process consumes 287MB. My machine has 16G of memory, so I don't care about that. And with about 3.6s it's reasonably fast to run this program, but I have an SSD a newish CORE i7 machine.

But it will not eat all your memory if you have 500MB to spare. Likely an SQLite approach would consume less memory, but you have to benchmark the speed of this versus the SQLite approach to decide which one is fater.

I used the method described in this answer to read the file into an SQLite database1. I needed to add a header line to the file first, but that's trivial.

$ sqlite3 test.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode csv test
sqlite> .import test.csv test

Since I couldn't measure this properly, let's say it felt like about 2 seconds. Then I added an index for the specific query.

sqlite> CREATE INDEX foo ON test ( customer, product );

This felt like it took another one second. Now I could query.

sqlite> SELECT * FROM test WHERE customer='Customer23' AND product='Product1';
Customer23,Product1,cat2,many,other,info

The result appeared instantaneously (which is not scientific!). Since we didn't measure how long retrieval from the Perl data structure takes, we cannot compare them, but it feels like it all takes about the same time.

However, the SQLite file size is only 38839296, which is about 39MB. That's bigger than the CSV file, but not by a lot. It seems like the sqlite3 process only consumes about 30kB of memory, which I find weird given the index.

In conclusion, the SQLite seems to be a bit more convenient and eat less memory. There is nothing wrong with doing this in Perl, and it might be the same speed, but using SQL for this type of query feels more natural, so I would go with this.

If I might be so bold I would assume you didn't set an index on your table when you did it in SQLite and that made it take longer. The amount of rows we have here is not that much, even for SQLite. Properly indexed it's a piece of cake.

If you don't actually know what an index does, think about a phone book. It has the index of first letters on the sides of the pages. To find John Doe, you grab D, then somehow look. Now imagine there was no such thing. You need to randomly poke around a lot more. And then try to find the guy with the phone number 123-555-1234. That's what your database does if there is no index.


1) If you want to script this, you can also pipe or read the commands into the sqlite3 utility to create the DB, then use Perl's DBI to do the querying. As an example, sqlite3 foo.db <<<'.tables\ .tables' (where the backslash \ represents a literal linebreak) prints the list of tables twice, so importing like this will work, too.

Community
  • 1
  • 1
simbabque
  • 53,749
  • 8
  • 73
  • 136
  • Hey, I love your answer but my main question is unanswered : *I was wondering if it okay to store this complete CSV in some complex perl Data structure?*. I have never created a Data structure of his size. – GrSrv Jan 12 '17 at 09:28
  • @GrSrv oh. True, I did not answer that. Well as long as your machine doesn't crash it's ok to do anything you want. The question you should ask is if it's useful to do it. ;) I'll add a bit more to the answer. – simbabque Jan 12 '17 at 10:02
  • @GrSrv I'm done updating the answer. There's a benchmark, some musing about viability of the approach and a comparison to SQLite. Have fun. :) – simbabque Jan 12 '17 at 10:48
  • Whoa! That's simply brilliant. Couldn't thank you enough :-). I am using indexes in sqlite. I had to ask this because Currently I can process approx 150 CSVs/day and I am looking for areas which could be optimized to gain speed. In your example you directly imported the CSV to SQLite which isn't an option for me. I have to use a CSV module and insert into SQLite row by row which takes quite some time. – GrSrv Jan 12 '17 at 12:07
  • 1
    @GrSrv why is that the case? Do you do that in Perl with DBI? If that makes it faster, there is nothing wrong with shelling out and calling the `sqlite3` tool I would say. Just document why you do that properly. If you really can't and want DBI, there should be faster ways than row by row. Doing it in blocks ok 1000 per `execute` for example. Or everything at once with a transaction. In any case, put in the index first, so it gets built up at the same time, which should be faster in total than adding it later like I did in my test. – simbabque Jan 12 '17 at 12:25
  • I use DBI. Actually I have to do some modifications in the data, like change case etc which is logically helpful if done before inserting to DB. So, I have to do row-by-row. Yes, I put in the index first. :) – GrSrv Jan 12 '17 at 13:01