11

I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:

G = []
with open("test.txt", 'r') as f:
    for line in f:
        G.append(list(map(int,line.split())))

However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
Sergey Ivanov
  • 3,719
  • 7
  • 34
  • 59

6 Answers6

25

numpy has the functions loadtxt and genfromtxt, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv function in pandas (http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt, 26 seconds with numpy.genfromtxt, and a little over 1 second with pandas.read_csv.

Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches in a separate shell.)

In [1]: import pandas as pd

In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop

In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop

In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • +1, didn't saw your answer while I was preparing mine. I just benchmarked the version of the OP too, which takes about 16s on my machine. I also noted, that `loadtxt` is slow. I'm not sure why, I would expect it to be faster (and it should be faster than `genfromtxt`. DO you also use numpy 1.7? – bmu Feb 26 '13 at 20:03
  • @bmu: Yes, I used numpy 1.7. – Warren Weckesser Feb 26 '13 at 20:05
  • 2
    I opened an numpy issue: https://github.com/numpy/numpy/issues/3019. I can not imagine, that it is normal that `loadtxt` is so slow. – bmu Feb 26 '13 at 20:16
  • @BranAlgue: Christoph Gohlke provides a tremendous service to the Python community by preparing and hosting binary builds of NumPy (and many other packages) for Windows. Take a look: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy – Warren Weckesser Feb 26 '13 at 20:38
  • Hey, @WarrenWeckesser it helped. It read file but took about a minute to do this, and numbers are float type, which is not right. Unfortunately, there is no pandas for Python 3.3. Maybe to reinstall on 3.2? – Sergey Ivanov Feb 27 '13 at 05:13
  • Seems not so obvious how to run pandas even in Python 3.2. After installation numpy and pandas for 3.2 got errors:`File "C:\Program Files\Python32\lib\site-packages\pandas\__init__.py", line 6, in from . import hashtable, tslib, lib File "tslib.pyx", line 25, in init pandas.tslib (pandas\tslib.c:38141) ImportError: No module named dateutil.parser` – Sergey Ivanov Feb 27 '13 at 08:39
  • A new question about installing pandas, asked either here or on the pydata mailing list (https://groups.google.com/forum/?fromgroups#!forum/pydata), is probably your best bet. – Warren Weckesser Feb 27 '13 at 12:32
5

pandas which is based on numpy has a C based file parser which is very fast:

# generate some integer data (5 M rows, two cols) and write it to file
In [24]: data = np.random.randint(1000, size=(5 * 10**6, 2))

In [25]: np.savetxt('testfile.txt', data, delimiter=' ', fmt='%d')

# your way
In [26]: def your_way(filename):
   ...:     G = []
   ...:     with open(filename, 'r') as f:
   ...:         for line in f:
   ...:             G.append(list(map(int, line.split(','))))
   ...:     return G        
   ...: 

In [26]: %timeit your_way('testfile.txt', ' ')
1 loops, best of 3: 16.2 s per loop

In [27]: %timeit pd.read_csv('testfile.txt', delimiter=' ', dtype=int)
1 loops, best of 3: 1.57 s per loop

So pandas.read_csv takes about one and a half second to read your data and is about 10 times faster than your method.

bmu
  • 35,119
  • 13
  • 91
  • 108
1

As a general rule of thumb (for just about any language), using read() to read in the entire file is going to be quicker than reading one line at a time. If you're not constrained by memory, read the whole file at once and then split the data on newlines, then iterate over the list of lines.

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685
0

The easiest speedup would be to go for PyPy http://pypy.org/

The next issue to NOT read the file at all (if possible). Instead process it like a stream.

Udo Klein
  • 6,784
  • 1
  • 36
  • 61
0

List comprehensions are often faster.

G = [[int(item) item in line.split()] for line in f]

Beyond that, try PyPy and Cython and numpy

forivall
  • 9,504
  • 2
  • 33
  • 58
  • `G = [map(int, line.split()) for line in f]` is faster. – Steven Rumbalski Feb 26 '13 at 18:41
  • @StevenRumbalski This line produces map objects:`[, , ...`. But @forivall line works. – Sergey Ivanov Feb 26 '13 at 18:54
  • @BranAlgue. Aha! You are using Python 3. So change that to `G = [list(map(int, line.split())) for line in f]`. It is still faster than the nested list comprehension. – Steven Rumbalski Feb 26 '13 at 19:01
  • It's strange @StevenRumbalski because your line works slowly:`stmt = ''' with open("SCC.txt", 'r') as f: G = [list(map(int, line.split())) for line in f] ''' test1 = timeit.timeit(stmt, number = 1) stmt = ''' with open("SCC.txt", 'r') as f: G = [[int(item) for item in line.split()] for line in f] ''' test2 = timeit.timeit(stmt, number = 1)`. `>>> test1 16.291107619840908 >>> test2 11.386214308615607` – Sergey Ivanov Feb 26 '13 at 19:10
  • It's possible that Python 3 changed improved the performance of listcomps. Old question outlining this: http://stackoverflow.com/questions/1247486/python-list-comprehension-vs-map – forivall Feb 26 '13 at 19:17
  • @forivall: Both of my comments were preceded by performance testing, so I am surprised that Bran's tests turned out otherwise. – Steven Rumbalski Feb 26 '13 at 20:05
  • @BranAlgue. I see now that there are two items per row. In Python 3, the solution using `map` reaches parity with the nested list comp at six items per line and out performs it at seven items per line. – Steven Rumbalski Feb 26 '13 at 20:15
  • @StevenRumbalski Where did you find this? – Sergey Ivanov Feb 27 '13 at 04:48
  • @Bran Algue: From my own tests. I was curious as to why our tests had differing results so I pinned down the cause. – Steven Rumbalski Feb 27 '13 at 15:01
0

You might also try to bring the data into a database via bulk-insert, then processing your records with set operations. Depending on what you have to do, that may be faster, as bulk-insert software is optimized for this type of task.

Christopher Mahan
  • 7,621
  • 9
  • 53
  • 66