0

I'm increasingly using python instead of perl but have one problem: always when I want to process large files (>1GB) line by line python seems to take ages for a job that perl does in a fraction of the time. However, the general opinion on the web seems to be that python should be at least as fast for text processing as perl. So my question is what am I doing wrong?

Example:

Read a file line by line, split the line at every tab and add the second item to a list. My python solution would look something like this:

with open() as infile:
    for line in infile:
        ls = line.split("\t")
        list.append(ls[1])

The perl code would look like this:

open(my $infile,"<",file_path);
while(my $line=<$infile>){
    my @ls = split(/\t/,$line);
    push @list, $ls[1]
}
close($infile)

Is there any way to speed this up?

And to make it clear: I don't want to start the usual "[fill in name of script language A] is sooo much better than [fill in script language B]" thread. I'd like to use python more by this is a real problem for my work.

Any suggestions?

Lowry
  • 49
  • 4
  • what are you trying to do with `list` once you have it? Python has good support for generators and iterators, you can `yield ls[1]` inside of appending to a global variable. – Greg Nisbet Mar 18 '17 at 06:21
  • What sort of performance differences are we talking about here? – juanpa.arrivillaga Mar 18 '17 at 06:21
  • 3
    I ran your code against [the equivalent Perl code](https://gist.github.com/schwern/0e81307f0be0cce37aad9c0ac5fd3045) over a simple CSV file with 10 million lines of "foo\tbar\tbaz\tbiff\n" and the Python code ran twice as fast. Could you show your equivalent Perl code? Also, your code isn't real, `open` is missing arguments. Can you show us the *real* code? Finally, why do you think that bit of code is the performance problem? – Schwern Mar 18 '17 at 06:35
  • @nfnneil it's usually lookup tables or mapping files of gene names. – Lowry Mar 18 '17 at 06:35
  • @GregoryNisbet Would that make the file processing faster? Otherwise it's not helping with my problem – Lowry Mar 18 '17 at 06:37
  • @Lowry ... yes it would. If it's super critical that the first part of your pipeline be as fast as possible you can read lines from a process (e.g. `awk '{print $2}' input_file.txt` or `awk '{print $2}' | sort`). Can you give us some more context on what you're trying to do with the data once you've extracted it from the file? – Greg Nisbet Mar 18 '17 at 06:40
  • @GregoryNisbet Sorry mate, not looking for a way to use bash/awk + python as I can do it cleaner in perl. I'm trying to understand a potential mistake in my python code .... – Lowry Mar 18 '17 at 06:47
  • @Lowry: Then why did you ask *"Is there any way to speed this up"*? – Borodin Mar 18 '17 at 09:18
  • @Borodin In context of the intro I guess one could re-formulate the question to "is there a pythonic way to speed this up" as I considered the possibility that I am still using a perl-related syntax where I should use something else in python – Lowry Mar 18 '17 at 09:39
  • 2
    I concur with Schwern: Python is roughly twice as fast. Please create a benchmark that shows 1) your sample data, or how to generate it 2) the actual code for both programs 3) how you measured the times 4) the actual times 5) details about your setup, like the bit about being on NFS that you mentioned in the comments. – ThisSuitIsBlackNot Mar 18 '17 at 14:24
  • 1
    Versions? Last I checked, older perls built with usefaststdio could be significantly faster than newer perls that have to support the PerlIO layer. If you are using such an older Perl, your results make more sense. – ysth Mar 19 '17 at 19:31

1 Answers1

1

Is there any way to speed this up?

Yes, import the CSV in SQLite and process it there. In your case you want .mode tabs instead of .mode csv.

Using any programming language to manipulate a CSV file is going to be slow. CSV is a data transfer format, not a data storage format. They'll always be slow and unwieldy to work with because you're constantly reparsing and reprocessing the CSV file.

Importing it into SQLite will put it into a much, much more efficient data format with indexing. It will take about as much time as Python would, but only has to be done once. It can be processed using SQL, which means less code to write, debug, and maintain.

See Sqlite vs CSV file manipulation performance.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Thanks for the suggestions but given that I have to use the NFS on our compute cluster SQLite is not an option (already did that and lost the db twice thanks to the known "database locked" problem). Additionally, my perl code runs reasonably fast but, as mentioned, I thought there might be a "nifty" way of using just python. Thanks anyways – Lowry Mar 18 '17 at 07:04