I need to process lots of medium to large files (a few hundred MB to GBs) in a linewise manner, so I'm interested in standard D approaches for iterating over lines. The foreach(line; file.byLine())
idiom seems to fit the bill and is pleasantly terse and readable, however performance seems to be less than ideal.
For example, below are two trivial programs in Python and D for iterating over the lines of a file and counting the lines. For a ~470 MB file (~3.6M lines) I get the following timings (best out of 10):
D times:
real 0m19.146s
user 0m18.932s
sys 0m0.190s
Python times (after EDIT 2, see below) :
real 0m0.924s
user 0m0.792s
sys 0m0.129s
Here's the D version, compiled with dmd -O -release -inline -m64
:
import std.stdio;
import std.string;
int main(string[] args)
{
if (args.length < 2) {
return 1;
}
auto infile = File(args[1]);
uint linect = 0;
foreach (line; infile.byLine())
linect += 1;
writeln("There are: ", linect, " lines.");
return 0;
}
And now the corresponding Python version:
import sys
if __name__ == "__main__":
if (len(sys.argv) < 2):
sys.exit()
infile = open(sys.argv[1])
linect = 0
for line in infile:
linect += 1
print "There are %d lines" % linect
EDIT 2: I changed the Python code to use the more idiomatic for line in infile
as suggested in the comments below, leading to an even greater speed-up for the Python version, which is now approaching the speed of the standard wc -l
call to the Unix wc
tool.
Any advice or pointers to what I might be doing wrong in D, that is giving such poor performance?
EDIT: And for comparison, here's a D version that throws the byLine()
idiom out the window and sucks all the data into memory at once, and then splits the data into lines post-hoc. This gives better performance but is still about 2x slower than they Python version.
import std.stdio;
import std.string;
import std.file;
int main(string[] args)
{
if (args.length < 2) {
return 1;
}
auto c = cast(string) read(args[1]);
auto l = splitLines(c);
writeln("There are ", l.length, " lines.");
return 0;
}
The timings for this last version are as follows:
real 0m3.201s
user 0m2.820s
sys 0m0.376s