6

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before solving this issue I would like to see this line, but alas my favorite method of

import linecache
print linecache.getline(filename, linenumber)

is throwing a MemoryError. Interestingly the manual says that "This function will never throw an exception." On this large file it throws one as I try to read line number 1, and I have about 6GB free RAM...

I would like to know what is the most elegant method to get to that unreachable line. Available tools are Python 2, Python 3 and C# 4 (Visual Studio 2010). Yes, I understand that I can always do something like

var line = 0;
using (var stream = new StreamReader(File.OpenRead(@"s:\source\transactions.dat")))
{
     while (++line < 219506324) stream.ReadLine(); //waste some cycles
     Console.WriteLine(stream.ReadLine());
}

Which would work, but I doubt it's the most elegant way.

EDIT: I'm waiting to close this thread, because the hard drive containing the file is being used right now by another process. I'm going to test both suggested methods and report timings. Thank you all for your suggestions and comments.

The Results are in I implemented Gabes and Alexes methods to see which one was faster. If I'm doing anything wrong, do tell. I'm going for the 10 millionth line in my 100GB file using the method Gabe suggested and then using the method Alex suggested which i loosely translated into C#... The only thing I'm adding from myself, is first reading in a 300 MB file into memory just to clear the HDD cache.

const string file = @"x:\....dat"; // 100 GB file
const string otherFile = @"x:\....dat"; // 300 MB file
const int linenumber = 10000000;

ClearHDDCache(otherFile);
GabeMethod(file, linenumber);  //Gabe's method

ClearHDDCache(otherFile);
AlexMethod(file, linenumber);  //Alex's method

// Results
// Gabe's method: 8290 (ms)
// Alex's method: 13455 (ms)

The implementation of gabe's method is as follows:

var gabe = new Stopwatch();
gabe.Start();
var data = File.ReadLines(file).ElementAt(linenumber - 1);
gabe.Stop();
Console.WriteLine("Gabe's method: {0} (ms)",  gabe.ElapsedMilliseconds);

While Alex's method is slightly tricker:

var alex = new Stopwatch();
alex.Start();
const int buffersize = 100 * 1024; //bytes
var buffer = new byte[buffersize];
var counter = 0;
using (var filestream = File.OpenRead(file))
{
    while (true) // Cutting corners here...
    {
        filestream.Read(buffer, 0, buffersize);
        //At this point we could probably launch an async read into the next chunk...
        var linesread = buffer.Count(b => b == 10); //10 is ASCII linebreak.
        if (counter + linesread >= linenumber) break;
        counter += linesread;
    }
}
//The downside of this method is that we have to assume that the line fit into the buffer, or do something clever...er
var data = new ASCIIEncoding().GetString(buffer).Split('\n').ElementAt(linenumber - counter - 1);
alex.Stop();
Console.WriteLine("Alex's method: {0} (ms)", alex.ElapsedMilliseconds);

So unless Alex cares to comment I'll mark Gabe's solution as accepted.

alain.janinm
  • 19,951
  • 10
  • 65
  • 112
Gleno
  • 16,621
  • 12
  • 64
  • 85
  • Might I suggest running your giant file through a utility like *nix `split` to break it up into a set of smaller files that would be easier to work with? http://ss64.com/bash/split.html – Amber Aug 28 '10 at 03:55
  • 1
    When you're dealing with 100 GB files, adding an extra pass to split the data apart--creating another 100 GB of data--is a huge performance hit, which should be avoidable. – Glenn Maynard Aug 28 '10 at 05:18
  • 1
    Elegant, shmelegant. Don't forget that the original problem was the database insert issue, not the slickest way to read line 'n' from large data file z.dat. Write the C# four liner, let it run for 10 minutes, get the troublesome line, and move on. Or use `split` to break up the file as @Amber suggested, and see if simply using smaller files does the trick. (My suspicion is that BULK INSERT is trying to do all these inserts in a single transaction, and is blowing up some database before-image file since it has to be able to ROLLBACK. See if you have a BULK INSERT COMMIT EVERY 1000000 option.) – PaulMcG Aug 28 '10 at 08:10
  • Gleno: How many bytes of the file has to be read to get to the requested line? – Gabe Aug 28 '10 at 21:15
  • Gleno: What makes you think that reading in a 300MB file will clear the HDD cache? Even on my machine with 2GB of RAM I doubt that a 300MB file would clear it. – Gabe Aug 29 '10 at 05:21
  • To be honest, I didn't think it through. Now that I do, the line that I'm seeking to is about 4 times my total RAM, which perhaps invalidates the point in the first place. – Gleno Aug 30 '10 at 02:25

5 Answers5

8

Here's my elegant version in C#:

Console.Write(File.ReadLines(@"s:\source\transactions.dat").ElementAt(219506323));

or more general:

Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));

Of course, you may want to show some context before and after the given line:

Console.Write(string.Join("\n",
              File.ReadLines(filename).Skip(linenumber - 5).Take(10)));

or more fluently:

File
.ReadLines(filename)
.Skip(linenumber - 5)
.Take(10)
.AsObservable()
.Do(Console.WriteLine);

BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:

    fp = open(fullname, 'rU')
    lines = fp.readlines()
    fp.close()

In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."

Gabe
  • 84,912
  • 12
  • 139
  • 238
  • +1, Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1)); is certainly elegant. Considering all .NET file i/o is buffered, it would be efficient from disk i/o perspective as well as. – VinayC Aug 28 '10 at 03:48
6

Well, memory can run out at any time, asynchronously and unpredictably -- that's why the "never an exception" promise doesn't really apply there (just like, say, in Java, where every method must specify which exceptions it can raise, some exceptions are exempted from this rule, since just about any method, unpredictably, can raise them at any time due to resource scarcity or other system-wide issues).

linecache tries to read the whole file. Your only simple alternative (hopefully you're not in a hurry) is to read one line at a time from the start...:

def readoneline(filepath, linenum):
    if linenum < 0: return ''
    with open(filepath) as f:
        for i, line in enumerate(filepath):
            if i == linenum: return line
        return ''

Here, linenum is 0-based (if you don't like that, and your Python is 2.6 or better, pass a starting value of 1 to enumerate), and the return value is the empty string for invalid line numbers.

Somewhat faster (and a lot more complicated) is to read, say, 100 MB at a time (in binary mode) into a buffer; count the number of line-ends in the buffer (just a .count('\n') call on the buffer string object); once the running total of line ends exceeds the linenum you're looking for, find the Nth line-end currently in the buffer (where N is the difference between linenum, here 1-based, and the previous running total of line ends), read a bit more if the N+1st line-end is not also in the buffer (as that's the point where your line ends), extract the relevant substring. Not just a couple of lines net of the with and returns for anomalous cases...;-).

Edit: since the OP comments doubting that reading by-buffer instead of by-line can make a performance difference, I unrooted an old piece of code where I was measuring the two approaches for a somewhat-related tasks -- counting the number of lines with the buffer approach, a loop on lines, or reading the whole file in memory at one gulp (by readlines). The target file is kjv.txt, the standard English text of the King James' Version of the Bible, one line per verse, ASCII:

$ wc kjv.txt 
  114150  821108 4834378 kjv.txt

Platform is a MacOS Pro laptop, OSX 10.5.8, Intel Core 2 Duo at 2.4 GHz, Python 2.6.5.

The module for the test, readkjv.py:

def byline(fn='kjv.txt'):
    with open(fn) as f:
        for i, _ in enumerate(f):
            pass
    return i +1

def byall(fn='kjv.txt'):
    with open(fn) as f:
        return len(f.readlines())

def bybuf(fn='kjv.txt', BS=100*1024):
    with open(fn, 'rb') as f:
        tot = 0
        while True:
            blk = f.read(BS)
            if not blk: return tot
            tot += blk.count('\n')

if __name__ == '__main__':
    print bybuf()
    print byline()
    print byall()

The prints are just to confirm correctness of course (and do;-).

The measurement, of course after a few dry runs to ensure everybody's benefitting equally from the OS's, disk controller's, and filesystem's read-ahead functionality (if any):

$ py26 -mtimeit -s'import readkjv' 'readkjv.byall()'
10 loops, best of 3: 40.3 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.byline()'
10 loops, best of 3: 39 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.5 msec per loop

The numbers are quite repeatable. As you see, even on such a tiny file (less than 5 MB!), by-line approaches are slower than buffer-based ones -- just too much wasted effort!

To check scalability, I next used a 4-times-larger file, as follows:

$ cat kjv.txt kjv.txt kjv.txt kjv.txt >k4.txt
$ wc k4.txt
  456600 3284432 19337512 k4.txt
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.4 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf("k4.txt")'
10 loops, best of 3: 102 msec per loop

and, as predicted, the by-buffer approach scales just about exactly linearly. Extrapolating (always a risky endeavour, of course;-), a bit less than 200 MB per second seems to be the predictable performance -- call it 6 seconds per GB, or maybe 10 minutes for 100 GB.

Of course what this small program does is just line counting, but (once there is enough I/O t amortize constant overheads;-) a program to read a specific line should have similar performance (even though it needs more processing once it's found "the" buffer of interest, it's a roughly constant amount of processing, for a buffer of a given size -- presumably repeated halving of the buffer to identify a small-enough part of it, then a little bit of effort linear in the size of the multiply-halved "buffer remainder").

Elegant? Not really... but, for speed, pretty hard to beat!-)

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • Do you really think it would be that much faster to do a low-level line-end count? I'd think that either way would be about as slow as reading in the whole file from disk. – Gabe Aug 28 '10 at 03:43
  • 2
    When something that calls itself a "cache" proclaims that it will never throw an exception, it implies me that it discards old members from its cache when it runs out of memory. Based on the docs for the function, I would *not* expect it to read all requested files into memory and never release them! – Gabe Aug 28 '10 at 03:58
  • @Gabe, I see what you mean wrt the docs, and I recommend you propose the specific patch in Python's docs that would have clarified the issue to you more sharply (see http://www.python.org/dev/patches/ ). Wrt speed, reading a file sequentially 100MB at a time is faster than reading it 100GB at a time unless you have more than 100GB of RAM available (as the latter course then either fails, or thrashes in paging); and it's faster than reading one line at a time as it avoids mucho parsing, object allocation, and so on, that you don't need, _and_ uses a larger buffer. – Alex Martelli Aug 28 '10 at 04:52
  • BTW, also see http://stackoverflow.com/questions/2985725/moving-to-an-arbitrary-position-in-a-file-in-python/2985778#2985778 (based on a slightly different case where lines from the files would need to be read repeatedly, so building an index was clearly useful) -- also, I'm pretty sure I did at least once post to SO several time measurements for this kind of buffering approach vs a line-by-line one, but I can't easily find the SO post where I did so, alas. – Alex Martelli Aug 28 '10 at 05:08
  • Alex: I just meant that since this process is likely to be I/O-bound, you're not likely to get much speedup by optimizing the line counting. – Gabe Aug 28 '10 at 05:08
  • @Gabe, as I said, I've measured -- not on _your_ 100-GB-file, your OS, your filesystem, your specific HW, etc, of course, which is why it's important that you do your own benchmarks. But intuition on performance is often misleading: carefully arranged measurement is a much more reliable guide. – Alex Martelli Aug 28 '10 at 05:13
  • I see someone has submitted a patch a while back to improve linecache handling of large files (http://bugs.python.org/issue1708). If someone wanted to review, update, and test it, that could improve the chances of it getting accepted for Python 3: – Ned Deily Aug 28 '10 at 05:13
  • It should be reading it in reasonably small blocks, eg. 64k at a time on demand, caching line-number to file-offset mappings to allow quick seeking by line number, and adjusting the distance between cached lines to avoid linearly increasing memory usage, so it works on huge files like this. That's certainly what I'd expect linecache to do based on the documentation, but it's not at all what it does. Looking at the source code, it's a very crude library that I'd suggest doesn't belong in the Python library at all. I even see thread-safety issues on casual inspection. – Glenn Maynard Aug 28 '10 at 05:17
  • Glenn: The linechache module was intended for caching source files for printing exception tracebacks, and for this it does quite well. You want it to cache the source code at the time it was parsed by Python, not at the time the exception happened. BTW, that's why this code isn't supposed to throw exceptions -- it's intended to be used by the exception code path. – Gabe Aug 28 '10 at 05:22
  • @Gabe, OK -- edited answer to show explicit buffering's vastly superior performance in a simple, easily repeatable benchmark case. – Alex Martelli Aug 28 '10 at 05:39
  • Alex, you just proved my point. :) Your buffering method is only 60% faster, and *only* if your disk can supply data at a staggering sustained 200MB/s! (1st gen SATA is only 150MB/s.) Since the general case is that you *cannot* saturate your disk interface for several minutes at a time, there's unlikely to be a measurable difference in the running time for the "faster" algorithm. – Gabe Aug 28 '10 at 06:01
  • Apart from comparing algorithm and buffering performance, it would be interesting to compare timing of the Python solution with the OP's C#/.Net solution. – PaulMcG Aug 28 '10 at 08:03
  • 1
    @Alex, Maybe there's a typo, "for i, line in enum(filepath):" -> "for i, line in enumerate(f)". by the way, enumerate (Python 2.6) has a start parameter for 1-based or whatever-based taste :) – sunqiang Aug 28 '10 at 09:01
  • @Gabe: A tangental but important nitpick: CPU use is still CPU use, whether or not it's the bottleneck for that particular operation. I don't want a program tying up a whole CPU core if it only needs 25% of one. Regardless of whether what *it's* doing will happen any faster as a result, it's still wasting resources that could be used elsewhere. – Glenn Maynard Aug 28 '10 at 10:03
  • @Alex Martelli, re "unless you have more than 100GB of RAM available": Actually, it's almost certainly faster to read in smaller chunks--say, 512k at a time--regardless of how much memory you have, because it'll keep the working set in L2 cache. Just as importantly, it means you're doing a very common operation instead of a very unusual one, which means you're going to be hitting much better tested and optimized code paths. – Glenn Maynard Aug 28 '10 at 10:16
  • @Glenn, have you noticed that, while I _measure_, you _blabber_? Get some real-world numbers in lieu of all of this blather, and the discussion will be _bilaterally_ meaningful (read Stroustrup's book on the design history of C++ about how his trusting badly "well-argued blabber" w/o supporting **numbers**, **data**, caused design errors that were revealed too late to fix them -- those who don't know history are doomed to repeat it...). Go data-driven, don't repeat the errors of history. – Alex Martelli Aug 28 '10 at 14:14
  • @Gabe, are you claiming that a 2-plus years old Mac laptop can deliver "staggering" disk performance?! I guess Steve Jobs would be tickled pink to read this - even his most unabashed fanboyz don't usually make _quite_ as extravagant claims wrt the performance of Macs!-) So, **measure** (ideally on a serious server, of course), for goodness sake...! Why am I usually the only one who can be bothered to verify claims with numbers?!-) – Alex Martelli Aug 28 '10 at 14:20
  • Alex: *Clearly* your disk is not capable of exceeding the bandwidth of its interface. The problem is that I'm claiming that scanning a 100GB file is an I/O-bound operation such that micro-optimizing the scanner won't matter, and you come back with a benchmark where you cache the complete file so you're not doing any I/O at all! I'm making an argument relating to disk speed, but your benchmark never hits the disk, so it's irrelevant. – Gabe Aug 28 '10 at 16:26
  • I've tested both methods and it would seem that Gabe's faster. Care to comment? – Gleno Aug 28 '10 at 18:47
  • @Gleno, I have no idea what the C# code is doing "inside", nor of what the exact numbers are (and on what size, platform, &c), so how could I comment? @Gabe, if you add the same delay to two approaches A and B (of which A is faster) to account for I/O delays, A will still be faster -- waste less CPU as @Glenn commented to you, for example -- so why should B ever become preferable? – Alex Martelli Aug 28 '10 at 23:58
  • @Alex Martelli: Thanks for reminding me (and everyone else present) why I don't respond to you; you're a useless, chronic flamer. EOF. – Glenn Maynard Aug 29 '10 at 00:15
  • Alex: If approach B is easier to write, easier to understand, easier to test, and easier to maintain, yet is only a few ms slower, why bother with approach A that might have some difficult corner case (like the line you want straddles multiple buffers). Will those few ms you save on every run ever add up to the extra time you spent coding A? Also, who's to say that the B program will even take longer to finish than the A program? I'd have to see a test on a sufficiently large file (i.e. none of it cached) before I believed that one was faster. – Gabe Aug 29 '10 at 05:16
1

You can try this sed one-liner: sed '42q;d' to fetch line number 42. It's not in Python or C#, but I assume you have sed on your Mac.

Adam Schmideg
  • 10,590
  • 10
  • 53
  • 83
0

Not a elegant but a faster solution would be to use multiple threads (or tasks in .NET 4.0) to read & process multiple chunks of a file at the same time.

VinayC
  • 47,395
  • 5
  • 59
  • 72
  • It's unlikely that you can actually read in the data faster than you can process it, so trying to do it in parallel is pointless. – Gabe Aug 28 '10 at 03:38
  • @Gabe, I am actually suggesting to even read the data parallel (possible by seeking to the file position) - depending upon disk system, it may improve performance. – VinayC Aug 28 '10 at 03:42
  • I don't understand what you're suggesting. Since the only way to know what line number a specific byte on is to have looked at every byte prior to that specific one, how is seeking going to help you? – Gabe Aug 28 '10 at 03:48
  • A simple solution outline would be - say 4 threads can read 4 blocks of say 16K size at the same time and start counting line breaks within it. Each thread will update master count at the end (needs thread sync). You need to repeat it till you reach near to desired line number. Then you can run through file sequentially. – VinayC Aug 28 '10 at 04:53
  • Multiple threads to handle linearly streaming a file? That's not only inelegant, it'll be slower, insanely more complicated and bug-prone. If there's any chance of parallel reads making things faster (eg. RAID), your arbitrarily-aligned interleaving of reads isn't going to line up with that. This is trivial streaming, the most basic and most heavily-optimized I/O case; don't turn it into something ridiculously complicated. – Glenn Maynard Aug 28 '10 at 10:00
0

If you expect to have this operation needed often on the same file it would make sense to make an index.

You make an index by going through the whole file once and recording the positions of line beginnings, for example in a sqlite database. Then when you need to go to a specific line you query the index for it, seek to that position and read the line.

Toni Ruža
  • 7,462
  • 2
  • 28
  • 31