7

I want to create a multi-index DataFrame by reading a textfile. Is it faster to create the multi-index and then allocate data to it from the text file using df.loc[[],[]], or concatenate rows to the DataFrame and set the index of the DataFrame at the end? Or, is it faster to use a list or dict to store the data as it's read from the file, and then create a DataFrame from them? Is there a more pythonic or faster option?

Example text file:

A = 1
 B = 1
  C data
  0 1
  1 2
A = 1
 B = 2
  C data
  1 3
  2 4
A = 2
 B = 1
  C data
  0 5
  2 6

Output DataFrame:

A B C data
1 1 0 1
    1 2
1 2 1 3
    2 4
2 1 0 5
    2 6

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

bluprince13
  • 4,607
  • 12
  • 44
  • 91

2 Answers2

8

Element by element lookups in pandas is an expensive operation, so is aligning by index. I would read everything into arrays, create a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid append or lookups.

Here is a sample result assuming you have a dataset 2-D array with everything lumped in:

In [106]: dataset
Out[106]: 
array([[1, 1, 0, 1],
       [1, 1, 1, 2],
       [1, 2, 1, 3],
       [1, 2, 2, 4],
       [2, 1, 0, 5],
       [2, 1, 2, 6]])

In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
     ...: 
Out[107]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [108]: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
Out[108]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
%%timeit
1000 loops, best of 3: 1.75 ms per loop

In [110]: %%timeit
     ...: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
1000 loops, best of 3: 642 µs per loop
clocker
  • 1,376
  • 9
  • 17
5

Parsing the text file will be the bulk of your processing overhead.

If speed is the main concern I'd suggest using pickle or shelve to store the DataFrame object in a binary file ready for use.

If you need to use the text file for any reason, a separate module could be written for translating between formats.