How do I efficiently parse a CSV file in Perl?

Question

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.

My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.

My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?

note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.

edit

It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)

another edit

Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.

yet another edit

I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.

Any reason you need only built-in tools (I'm assuming no admin rights). Otherwise, try using the `Text::CSV` perl module. It makes CSV parsing a whole lot easier: http://search.cpan.org/~erangel/Text-CSV/CSV.pm — Vivin Paliath, Jun 17 '10 at 19:53
Why read in the whole file and `split()` by lines? If you just open the file and use the `` idiom, you can iterate on the lines so that you only need to store one line at a time in memory. — mob, Jun 17 '10 at 19:58
@Mike some perl modules from cpan do not require any compilation and can be used without admin's right ... if there is one of kind would it still be listed out of your needed sample ? — Prix, Jun 17 '10 at 20:12
Have you tried explaining to them that there is code out there that has been tested and in use by thousands of developers and that it's a waste of their money to pay you to make a poor copy of it? Play up the angle that it will save them money to let you do this, and it'll have less bugs for free. — Daenyth, Jun 17 '10 at 20:14
please believe me when i say, there is no chance of convincing management that installing modules is a good idea. i don't want to go into details. and it's not a matter of not having admin rights. i do have admin rights, but i need to follow protocols that i haven't defined and have no hope of changing. — Mike, Jun 17 '10 at 20:22
Your assumptions place a huge limiter in what you can do. If you have to read the entire, giant file into memory, and you can't use modules from CPAN, you are already going twice against best practices with Perl. :) I'd fight to get at least one of those restrictions changed. — Robert P, Jun 17 '10 at 20:49
I honestly would look around at other jobs, to be frank. No one would expect to hire a carpenter and say "by the way, you're not allowed to use screws, you're only allowed nails". I don't know why people think it flies with programming — Daenyth, Jun 17 '10 at 20:53
"Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory." -- do you want that, or do you want efficient? Because they're exact opposites. — hobbs, Jun 17 '10 at 21:34
Regarding your bureaucracy, please read [But I can't use CPAN!](http://www.shadowcat.co.uk/blog/matt-s-trout/but-i-cant-use-cpan/) There are ways around your problem -- at the worst, you can simply download the source you need, and check it into your repository right next to your other code. — Ether, Jun 17 '10 at 22:05

score 48 · Accepted Answer · edited Apr 03 '14 at 06:41

48

The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.

About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:

my $file = 'somefile.csv';
my @data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
    chomp $line;
    my @fields = split(/,/, $line);
    push @data, \@fields;
}

This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:

    my @fields = Text::ParseWords::parse_line(',', 0, $line);

edited Apr 03 '14 at 06:41

szabgab

6,202
11
50
64

answered Jun 17 '10 at 20:07

Michael Carman

30,628
10
74
122

When you say slower approach, do you mean this module has known performance issues or it is just slightly slower? – MikeKulls Jul 02 '15 at 06:36
2

@MikeKulls: I wouldn't call it a performance issue per se. It's a consequence of doing actual parsing instead of blindly assuming that every comma is a field separator. That said, it isn't "slightly slower." In a simple benchmark a bare `split` was 10-20x faster than `parse_line`. – Michael Carman Jul 02 '15 at 16:57
I would guess it is slower also because it is written in perl as opposed to C for the split function. In theory it should be possible to get performance closer to the split function, eg maybe 2-3x slower. – MikeKulls Jul 07 '15 at 00:49

score 20 · Answer 2 · answered Jun 18 '10 at 10:22

20

Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").

sub csvsplit {
        my $line = shift;
        my $sep = (shift or ',');

        return () unless $line;

        my @cells;
        $line =~ s/\r?\n$//;

        my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;

        while($line =~ /$re/g) {
                my $value = defined $1 ? $1 : $2;
                push @cells, (defined $value ? $value : '');
        }

        return @cells;
}

Use it like this:

while(my $line = <FILE>) {
    my @cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}

answered Jun 18 '10 at 10:22

jkramer

15,440
5
47
48

2

Don't roll your own CSV parsing routine. It's easy to do wrong and difficult to get right, and it might bite you HARD. Please use Text::CSV as mentioned by other posters. – MichielB Nov 01 '14 at 09:50
3

I wouldn't say never roll your own. What if you write one that is better than the existing solutions in some way? – MikeKulls Jul 07 '15 at 00:29

score 9 · Answer 3 · edited Apr 26 '15 at 15:01

9

As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).

If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:

use lib '/path/to/my/perllib';
use Text::CSV_PP;

Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:

my $data = stupid_required_interface_that_reads_the_entire_giant_file();

open my $text_handle, '<', \$data
   or die "Failed to open the handle: $!";

And then read via the Text::CSV interface:

my $csv = Text::CSV->new ( { binary => 1 } )
             or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
    ...
}

or the sub-optimal split on commas:

while (my $line = <$text_handle>) {
    my @csv = split /,/, $line;
    ... # regular work as before.
}

With this method, the data is only copied a bit at a time out of the scalar.

edited Apr 26 '15 at 15:01

serenesat

4,611
10
37
53

answered Jun 17 '10 at 21:00

Robert P

15,707
10
68
112

8

And the second most correct way to do this is to create the `Mike::Text::CSV` module, copy the source code from `Text::CSV` into it, and add a disclaimer about how it was "inspired" by the open source Text::CSV module. – mob Jun 17 '10 at 21:40
I like it! I like it very much. – Robert P Jun 17 '10 at 21:58
@RobertP, what is the name for $! in the end of the open ... die clause? – olala Feb 14 '14 at 19:53
1

@user2157668: It's the "error" variable (aka, the OS error). In numeric context, it's the value you'd get from calling `errno` in C after an OS function. In string context, it's the text description of what that error is. You can find more details about it here: [perldoc perlvar](http://perldoc.perl.org/perlvar.html) – Robert P Feb 14 '14 at 21:13

Byron Whitlock · Answer 4 · 2010-06-17T20:10:02.320

You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.

#(no error handling here!)    
open FILE, $filename
while (<FILE>) {
     @csv = split /,/ 

     # now parse the csv however you want.

}

Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.

YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.

Edit

Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.

Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.

When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.

That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.

Since perl 5.8, when the file is in memory (say, in a variable called `$scalar`), you can still use the filehandle iterator on it with `open(FILE,"<",\$scalar)` — mob, Jun 17 '10 at 21:43
Unfortunately, .CSV files can have embedded newlines inside quotes, in which case this solution will not separate the .CSV records in the correct place. It will work if there are no embedded newlines and no embedded commas within quoted strings though. — V. Wheeler, Apr 26 '23 at 15:38

ZyX · Answer 5 · 2010-06-17T21:30:53.773

1

Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:

my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
    push @{$result->[-1]}, $1;
    push @$result, [] if $2 eq "\n";
    last unless $2;
}

If you need to have $csv untouched:

local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
    next unless defined $_;
    if($_ eq "\n") {
        push @$result, []; }
    else {
        push @{$result->[-1]}, $_; }
}

edited Jun 17 '10 at 21:30

answered Jun 17 '10 at 21:12

ZyX

52,536
7
114
135

Other than padding your lines of code count, in what way is this better than `split`? – mob Jun 17 '10 at 21:37
@modrule If you use `split`, you need to use it twice, so the data will be read twice, my solution reads data only once. // But this is true only if the data is already loaded. – ZyX Jun 18 '10 at 11:03

score 1 · Answer 6 · answered Jun 18 '10 at 09:56

Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:

open(my $fh, '<', $input_file_path) or die;
my @all_lines = <$fh>;
for my $line (@all_lines) {
  chomp $line;
  my @fields = split ',', $line;
  process_fields(@fields);
}

And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...

How do I efficiently parse a CSV file in Perl?

6 Answers6

Linked

Related