14

I need to process lots of medium to large files (a few hundred MB to GBs) in a linewise manner, so I'm interested in standard D approaches for iterating over lines. The foreach(line; file.byLine()) idiom seems to fit the bill and is pleasantly terse and readable, however performance seems to be less than ideal.

For example, below are two trivial programs in Python and D for iterating over the lines of a file and counting the lines. For a ~470 MB file (~3.6M lines) I get the following timings (best out of 10):

D times:

real    0m19.146s
user    0m18.932s
sys     0m0.190s

Python times (after EDIT 2, see below) :

real    0m0.924s
user    0m0.792s
sys     0m0.129s

Here's the D version, compiled with dmd -O -release -inline -m64:

import std.stdio;
import std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine())
    linect += 1;
  writeln("There are: ", linect, " lines.");
  return 0;
}

And now the corresponding Python version:

import sys

if __name__ == "__main__":
    if (len(sys.argv) < 2):
        sys.exit()
    infile = open(sys.argv[1])
    linect = 0
    for line in infile:
        linect += 1
    print "There are %d lines" % linect

EDIT 2: I changed the Python code to use the more idiomatic for line in infile as suggested in the comments below, leading to an even greater speed-up for the Python version, which is now approaching the speed of the standard wc -l call to the Unix wc tool.

Any advice or pointers to what I might be doing wrong in D, that is giving such poor performance?

EDIT: And for comparison, here's a D version that throws the byLine() idiom out the window and sucks all the data into memory at once, and then splits the data into lines post-hoc. This gives better performance but is still about 2x slower than they Python version.

import std.stdio;
import std.string;
import std.file;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto c = cast(string) read(args[1]);
  auto l = splitLines(c);
  writeln("There are ", l.length, " lines.");
  return 0;
}

The timings for this last version are as follows:

real    0m3.201s
user    0m2.820s
sys     0m0.376s
Jean-François Corbett
  • 37,420
  • 30
  • 139
  • 188
Paul M.
  • 608
  • 3
  • 11
  • Tried with different versions of dmd (2.067.0-b3, 2.066.1, 2.064.2), with roughly the same result. The culprit seems to be `-m64`. Locally, for a 200M file consisting of short-ish lines (no more than 100 characters), the 32-bit version runs a bit faster than Python (1.5 vs. 1.8 seconds), but the 64-bit version takes 6.9 seconds, which is more than 4x worse that for 32 bits. Perhaps a 64-bit codegen inefficiency of some kind, worth reporting as a bug at http://issues.dlang.org. – Gassa Mar 08 '15 at 02:30
  • On a side note, yet another optimization flag is "-noboundscheck" (or its alternative form "-boundscheck=off" supported since 2.066). It completely disables array bounds checking. That said, it does not help much in this case. – Gassa Mar 08 '15 at 02:32
  • When I compile without the "-m64" flag I get slightly worse performance (though I am on a 64-bit machine, OS X 10.10; dmd v2.066) – Paul M. Mar 08 '15 at 02:35
  • It might be important to include a large text file generator so that others can check locally using the exact same setup. Or at least mention the range of line lengths in the file you use. (I was testing on Windows, also 64-bit.) – Gassa Mar 08 '15 at 02:50
  • By the way, I am not sure that compiling a 32-bit executable is the default on OSX, or possible at all. Please try explicit `-m32` for 32-bit mode to find out. – Gassa Mar 08 '15 at 02:55
  • 1
    Using the `-m32` flag it fails with a `ld: symbol(s) not found for architecture i386` error. I've gone ahead and opened an issue on the dlang.org website, including a link to the file I was using for testing purposes. See https://issues.dlang.org/show_bug.cgi?id=14256 . Thanks for your help. – Paul M. Mar 08 '15 at 03:01
  • 2
    `readlines` reads everything into memory; `list(file)` is a more idiomatic way to do that but in this case you should just do `for line in infile`. Note that if you want to compare only pure IO speeds you should consider a faster iterable-counting method [like given here](http://stackoverflow.com/a/15112059/1763356) - CPython isn't a fast interpreter. – Veedrac Mar 08 '15 at 18:20
  • @Veedrac Good point. I've update the Python code for the more idiomatic form, with a significant improvement in speed and decrease in memory usage. – Paul M. Mar 08 '15 at 19:20
  • You might want to try compiling with LDC, as DMD is quite bad at doing optimizations. – Meta Mar 11 '15 at 02:27
  • I did indeed try LDC, and it performed even worse in this instance. I think Gassa's comments / tests get to the core of the issue -- 32-bit vs 64-bit performance of the compilers. – Paul M. Mar 11 '15 at 02:39

6 Answers6

14

EDIT AND TL;DR: This problem has been solved in https://github.com/D-Programming-Language/phobos/pull/3089. The improved File.byLine performance will be available starting with D 2.068.

I tried your code on a text file with 575247 lines. The Python baseline takes about 0.125 seconds. Here's my codebase with timings embedded in the comments for each method. Explanations follow.

import std.algorithm, std.file, std.stdio, std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  size_t linect = 0;

  // 0.62 s
  foreach (line; File(args[1]).byLine())
    linect += 1;

  // 0.2 s
  //linect = args[1].readText.count!(c => c == '\n');

  // 0.095 s
  //linect = args[1].readText.representation.count!(c => c == '\n');

  // 0.11 s
  //linect = File(args[1]).byChunk(4096).joiner.count!(c => c == '\n');

  writeln("There are: ", linect, " lines.");
  return 0;
}

I used dmd -O -release -inline for each variant.

The first version (slowest) reads one line at a time. We could and should improve the performance of byLine; currently it's hamstrung by things like mixed use of byLine with other C stdio operations, which is probably overly conservative. If we do away with that, we can easily do prefetching etc.

The second version reads the file in one fell swoop and then uses a standard algorithm to count the lines with a predicate.

The third version acknowledges the fact that there's no need to mind any UTF subtleties; counting bytes is just as fine, so it converts the string to its byte-wise representation (at no cost) and then counts the bytes.

The last version (my fave) reads 4KB of data from the file at a time and flattens them lazily using joiner. Then again it counts the bytes.

Andrei Alexandrescu
  • 3,214
  • 19
  • 17
  • 1
    Andrei's answer provides some insights into IO in D, but I agree it doesn't really address the key issue I was struggling with -- how to efficiently iterate through a file in a line-wise manner. In any real application I'd be processing the lines / extracting information etc. The line counting example was primarily to illustrate the slow behavior of line-wise iteration in D. – Paul M. Mar 21 '15 at 22:43
  • 3
    @Veedrac: huh, you're right - got caught up in the microbenchmark at hand. I just took a look into things and boy could that code be improved. See https://github.com/D-Programming-Language/phobos/pull/3089. Under the same test conditions, the byLine version now takes 0.136 seconds. – Andrei Alexandrescu Mar 22 '15 at 05:48
2

I thought I'd do something new today, so I decided to "learn" D. Please note that this is the first D I've written, so I might be completely off.

The first thing I tried was manually buffering:

foreach (chunk; infile.byChunk(100000)) {
    linect += splitLines(cast(string) chunk).length;
}

Note that this is incorrect since it ignores lines crossing boundaries, but fixing that comes later.

This helped a bit, but not nearly enough. It did allow me to test

foreach (chunk; infile.byChunk(100000)) {
    linect += (cast(string) chunk).length;
}

which showed that all the time was in splitLines.

I made a local copy of splitLines. This alone increased speed by a factor of 2! I wasn't expecting this. I'm running with both

dmd -release -inline -O -m64 -boundscheck=on
dmd -release -inline -O -m64 -boundscheck=off

It's about the same either way.

Then I rewrote splitLines to be specialized on s[i].sizeof == 1, which only seems to be slower than Python now because it also breaks on paragraph separators.

To finish it up, I made a Range and optimized it further, which gets the code close to Python's speed. Considering that Python doesn't break on paragraph separators and the code underlying it is written in C, this seems OK. This code may have O(n²) performance on lines longer than 8k long, but I'm not sure.

import std.range;
import std.stdio;

auto lines(File file, KeepTerminator keepTerm = KeepTerminator.no) {
    struct Result {
        public File.ByChunk chunks;
        public KeepTerminator keepTerm;
        private string nextLine;
        private ubyte[] cache;

        this(File file, KeepTerminator keepTerm) {
            chunks = file.byChunk(8192);
            this.keepTerm = keepTerm;

            if (chunks.empty) {
                nextLine = null;
            }
            else {
                // Initialize cache and run an
                // iteration to set nextLine
                popFront;
            }
        }

        @property bool empty() {
            return nextLine is null;
        }

        @property auto ref front() {
            return nextLine;
        }

        void popFront() {
            size_t i;
            while (true) {
                // Iterate until we run out of cache
                // or we meet a potential end-of-line
                while (
                    i < cache.length &&
                    cache[i] != '\n' &&
                    cache[i] != 0xA8 &&
                    cache[i] != 0xA9
                ) {
                    ++i;
                }

                if (i == cache.length) {
                    // Can't extend; just give the rest
                    if (chunks.empty) {
                        nextLine = cache.length ? cast(string) cache : null;
                        cache = new ubyte[0];
                        return;
                    }

                    // Extend cache
                    cache ~= chunks.front;
                    chunks.popFront;
                    continue;
                }

                // Check for false-positives from the end-of-line heuristic
                if (cache[i] != '\n') {
                    if (i < 2 || cache[i - 2] != 0xE2 || cache[i - 1] != 0x80) {
                        continue;
                    }
                }

                break;
            }

            size_t iEnd = i + 1;
            if (keepTerm == KeepTerminator.no) {
                // E2 80 A9 or E2 80 A9
                if (cache[i] != '\n') {
                    iEnd -= 3;
                }
                // \r\n
                else if (i > 1 && cache[i - 1] == '\r') {
                    iEnd -= 2;
                }
                // \n
                else {
                    iEnd -= 1;
                }
            }

            nextLine = cast(string) cache[0 .. iEnd];
            cache = cache[i + 1 .. $];
        }
    }

    return Result(file, keepTerm);
}

int main(string[] args)
{
    if (args.length < 2) {
        return 1;
    }

    auto file = File(args[1]);
    writeln("There are: ", walkLength(lines(file)), " lines.");

    return 0;
}
Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • This is the fastest of the D code examples that would actually allow me to do some processing of the lines. The timings on the example input file above are: ```real 0m1.339s user 0m1.190s sys 0m0.144s``` – Paul M. Mar 21 '15 at 23:10
1

It is debatable whether counting lines is a good proxy for overall performance in a text processing application. You are testing the efficiency of python's C library, as much as anything else, and you will get different results once you actually start doing useful things with the data. D has had less time than Python to hone the standard library, and there are fewer people involved. The performance of byLine has been under discussion for a couple of years now, and I think the next release will be faster.

People do seem to find D efficient and productive for text processing of exactly this sort. For example, AdRoll is well-known as a python shop, but their data science guys use D:

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

To come back to the question, one is obviously comparing compilers and the library as much as one is the language. DMD's role is as the reference compiler, and one that compiles lightning fast. So it is great for rapid development and iteration, but if you need speed then you should use LDC or GDC, and if you do use DMD then turn on optimization and turn off bounds checking.

On my arch linux 64 bit HP Probook 4530s machine, using the last 1mm lines of the WestburyLab usenet corpus, I get the following:

python2: real 0m0.333s, user 0m0.253s, sys 0m0.013s

pypy (warmed up): real 0m0.286s, user 0m0.250s, sys 0m0.033s

DMD (defaults): real 0m0.468s, user 0m0.460s, sys 0m0.007s

DMD(-O -release -inline -noboundscheck): real 0m0.398s,user 0m0.393s,sys 0m0.003s

GDC (defaults): real 0m0.400s, user 0m0.380s, sys 0m0.017s [I don't know switches for GDC optimization]

LDC (defaults): real 0m0.396s,user 0m0.380s, sys 0m0.013s

LDC(-O5): real 0m0.336s, user 0m0.317s, sys 0m0.017s

In a real application, one will use the builtin profiler to identify hotspots and tweak the code, but I agree that naive D ought to be decent speed and at worst in same ballpark as python. And using LDC with optimization that is indeed what we see.

For completeness, I changed your D code to the following. (Some of the imports are not needed - I was playing around).

import std.stdio;
import std.string;
import std.datetime;
import std.range, std.algorithm;
import std.array;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto t=Clock.currTime();
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine)
    linect += 1;
  auto t2=Clock.currTime-t;
  writefln("There are: %s lines and took %s", linect, t2);
  return 1;
}
Cantillon
  • 19
  • 3
  • I cannot comment, but the example by Kozzi11 below is indeed faster on my machine, coming in at 0.255s using dmd optimized.possibly a local machine question. which version of DMD are you running? any other information would be helpful. – Cantillon Mar 22 '15 at 13:55
0

This should be faster than your version even than python version:

module main;

import std.stdio;
import std.file;
import std.array;

void main(string[] args)
{
    auto infile = File(args[1]);
    auto buffer = uninitializedArray!(char[])(100);
    uint linect;
    while(infile.readln(buffer))
    {
        linect += 1;
    }
    writeln("There are: ", linect, " lines.");
}
Kozzi11
  • 2,413
  • 12
  • 17
  • In fact, it has the same problem with `-m64` for me when tested locally. Also, it is still slower than Python with longer lines, regardless of 32 or 64 bit. I'll add some test generators and results to https://issues.dlang.org/show_bug.cgi?id=14256. – Gassa Mar 08 '15 at 12:11
0

tl;dr strings are auto-decoded, which makes splitLines slow.

The current implementation of splitLines decodes the string on the fly, which makes it slow. In the next version of phobos this will be fixed.

There will be a range that does that for you as well.

In general the D GC is not state of the art, however D gives you the opportunity to produce less garbage. To get a competitive program, you'll need to avoid useless allocations. Second big thing: For fast code use gdc or ldc, because the strength of dmd is to produce code fast and not fast code.

So I didn't time it but this version should not allocate after the biggest line, because it reuses the bufferand does not decode UTF.

import std.stdio;

void main(string[] args)
{
    auto f = File(args[1]);
    // explicit mention ubyte[], buffer will be reused
    // no UTF decoding, only looks for "\n". See docs.
    int lineCount;
    foreach(ubyte[] line; std.stdio.lines(f))
    {
        lineCount += 1;
    }

    writeln("lineCount: ", lineCount);
}

A version using ranges could look like this, if you require that every line ends with a terminator:

import std.stdio, std.algorithm;

void main(string[] args)
{
    auto f = File(args[1]);

    auto lineCount = f.byChunk(4096) // read file by chunks of page size 
`    .joiner // "concatenate" these chunks
     .count(cast(ubyte) '\n'); // count lines
    writeln("lineCount: ", lineCount);
}

In the next release, just do to get near optimal performance and breaking on all line breaking whitespace.

void main(string[] args)
{
    auto f = File(args[1]);

    auto lineCount = f.byChunk(4096) // read file by chunks of page size 
     .joiner // "concatenate" these chunks
     .lineSplitter // split by line
     .walkLength; // count lines
    writeln("lineCount: ", lineCount);
}
Panke
  • 156
  • 9
  • Please explain downvotes. This answer looks good to me so the downvote left me a bit perplexed. Consider also that Panke is effectively new, so giving downvotes without explaining is particularly harmful. – Veedrac Mar 20 '15 at 22:19
  • 1
    I was hopeful about your first example, as it facilitates line-wise processing, but unfortunately the timings are among the poorest of the examples I've tried. On the same data set I test the original code on I got: ```real 1m1.199s user 1m0.213s sys 0m0.618s``` – Paul M. Mar 21 '15 at 22:49
-1
int main()
{
    import std.mmfile;
    scope mmf = new MmFile(args[1]);
    foreach(line; splitter(cast(string)mmf[], "\n"))
    {
        ++linect;
    }
    writeln("There are: ", linect, " lines.");
    return 0;
}