What's the fastest way of reading data from a text file and allocating it to a data frame?

Question

I want to create a multi-index DataFrame by reading a textfile. Is it faster to create the multi-index and then allocate data to it from the text file using df.loc[[],[]], or concatenate rows to the DataFrame and set the index of the DataFrame at the end? Or, is it faster to use a list or dict to store the data as it's read from the file, and then create a DataFrame from them? Is there a more pythonic or faster option?

Example text file:

A = 1
 B = 1
  C data
  0 1
  1 2
A = 1
 B = 2
  C data
  1 3
  2 4
A = 2
 B = 1
  C data
  0 5
  2 6

Output DataFrame:

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

Maybe you could use the [`timeit`](https://docs.python.org/3/library/timeit.html) module to test it out. — juanpa.arrivillaga, Mar 16 '17 at 09:10

score 8 · Accepted Answer · answered Mar 17 '17 at 06:31

Element by element lookups in pandas is an expensive operation, so is aligning by index. I would read everything into arrays, create a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid append or lookups.

Here is a sample result assuming you have a dataset 2-D array with everything lumped in:

In [106]: dataset
Out[106]: 
array([[1, 1, 0, 1],
       [1, 1, 1, 2],
       [1, 2, 1, 3],
       [1, 2, 2, 4],
       [2, 1, 0, 5],
       [2, 1, 2, 6]])

In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
     ...: 
Out[107]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [108]: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
Out[108]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
%%timeit
1000 loops, best of 3: 1.75 ms per loop

In [110]: %%timeit
     ...: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
1000 loops, best of 3: 642 µs per loop

score 5 · Answer 2 · answered Sep 20 '17 at 00:05

Parsing the text file will be the bulk of your processing overhead.

If speed is the main concern I'd suggest using pickle or shelve to store the DataFrame object in a binary file ready for use.

If you need to use the text file for any reason, a separate module could be written for translating between formats.

What's the fastest way of reading data from a text file and allocating it to a data frame?

2 Answers2