2

I have a 100GB text file with about 50K rows, not of the same length.

It is too large to fit in memory, so currently I read it line by line. This also takes too long. Is there a smarter way to read the file? For example, to read a few rows at a time?

Jimmy Thompson
  • 1,004
  • 7
  • 17
Roy
  • 837
  • 1
  • 9
  • 22
  • 1
    With a file of that size, I believe the more important question is "What are you doing with the data as you read it?" instead of how to read it. – AKX May 20 '15 at 08:41
  • 4
    When you say 'takes too long' you need to look at where the overhead is. You have made the assumption that it is the IO that is slowing things, and you might be right, but without seeing code it is impossible to say. – cdarke May 20 '15 at 08:42
  • Do you have to read it line by line? You could just `read` out the maximum amount you can decently process and then do it. – Noufal Ibrahim May 20 '15 at 08:42
  • @AKX: I'm transforming each line to a sparse vector and then add it to another numpy vector. – Roy May 20 '15 at 08:43
  • 1
    Just checked, using `io.FileIO` instead of `open` gave me >25 times increase in speed. – bereal May 20 '15 at 08:44
  • So that numpy vector is getting larger and larger. Might that not be where the overhead is? – cdarke May 20 '15 at 08:44
  • @bereal: is that using the same version of python as the OP is using? – cdarke May 20 '15 at 08:45
  • @NoufalIbrahim: I need the lines, I can have a few of them at a time – Roy May 20 '15 at 08:45
  • 2
    First profile your code. Then optimize. – Łukasz Rogalski May 20 '15 at 08:46
  • A great answer to this very question was provided by @abarnert [here](http://stackoverflow.com/a/30294434/364980) – James Mills May 20 '15 at 08:47
  • @cdarke, sorry, I revoke my comment, was measuring it wrong. – bereal May 20 '15 at 08:48
  • By "add it to another vector" you mean vectorized sum or you append it? – Eli Korvigo May 20 '15 at 08:57
  • There is a way to read a few lines at a time. `f.readlines(16384)` will read about 16K and return it as a list of lines. See the docs for the `readlines` function. This rarely makes a difference, because Python is already buffering the reads anyway, but it's not hard to try it and test to see if it helps. – abarnert May 20 '15 at 09:37
  • Also, which version of Python are you using? And is it all ASCII, mostly ASCII, or neither? For example, if it's all or mostly ASCII and you're using Python 3.2, just upgrading to 3.4 should help. Or, if it's all ASCII and you can't upgrade, opening in binary mode should help. – abarnert May 20 '15 at 09:40

1 Answers1

8

The basic iteration over the lines of a file like this:

with open(filename) as f:
    for line in f:
        do_stuff(line)

This actually reads only the current line into memory and not more. If you want to have fine grained control over the buffer size I suggest you use io.open instead (for example, when your lines are all the same length, this might be useful).

If the operation on your data is actually not IO bound but CPU bound, it might be useful to use multiprocessing:

import multiprocessing

pool = multiprocessing.Pool(8)  # play around for performance

with open(filename) as f:
    pool.map(do_stuff, f)

This does not speed up the actual reading but might improve performance on processing the lines.

Constantinius
  • 34,183
  • 8
  • 77
  • 85
  • 1
    The use of multiprocessing here largely depends on whether the problem is I/O or CPU bound. – James Mills May 20 '15 at 08:49
  • 3
    Yes, thats why I mentioned it in my answer. – Constantinius May 20 '15 at 08:50
  • Thanks. But unfortunately my lines aren't of the same length. – Roy May 20 '15 at 08:52
  • @Roy still the multiprocessing might be of great help to you – Tim May 20 '15 at 08:53
  • 1
    @Roy how that's contradicts useful of this answer? – Łukasz Rogalski May 20 '15 at 08:54
  • @Constantinius does `Poll()` acquires lock for file pointer? – Łukasz Rogalski May 20 '15 at 08:55
  • `for line in f:` will not necessarily read exactly one line into memory at a time. – TigerhawkT3 May 20 '15 at 08:56
  • @ŁukaszR.: it doesn't. `Pool.map` takes any form of `iterator` and a `file` happens to be one. So the main process (the one that created the pool) reads the lines, sends them to the sub-processes and collects the results. – Constantinius May 20 '15 at 09:54
  • @TigerhawkT3: Sure, it does: https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects – Constantinius May 20 '15 at 09:56
  • @Constantinius, no it doesn't. The page you linked doesn't say anything about `for line in f:` reading exactly one line into memory at a time. It just says it's memory-efficient. All it can do is read chunks of bytes and return one line at a time. See [here] for more. – TigerhawkT3 May 20 '15 at 16:51
  • @TigerhawkT3: Your link seems to be missing. It might well be that it reads more than exactly one line. My guess is that the rest will be buffered. Since OP seems to want to read the whole file anyway, that hardly makes a difference, right? – Constantinius May 21 '15 at 06:33
  • Sorry about that. Link [here](http://stackoverflow.com/questions/29133556/does-for-line-in-file-read-entire-file). – TigerhawkT3 May 21 '15 at 07:49