4

I'm trying to find a more optimized way to add data to a pandas dataframe. I already saw other related questions where people suggested to first create lists and then add the data to pandas (which I now implemented).

In my current setup I loop through different lists (in the example it is librarynr, books and sections) and then compute various variables (in the example those are not computed but already set; nrofletters, excitment and review) which I add to lists and in the end add the lists to the dataframe.

Does anyone know of further optimizations to improve performance on this example code?

Important note: In my final code, the variables are not the same for all rows, but computed depending on the iterators of the loops (see example calculation of excitment).

Example code:

import pandas as pd
import time


books = ['LordOfTheRings','HarryPotter','LoveStory','RandomBook']
sections = ['Introduction','MainPart','Plottwist','SurprisingEnd']

librarynr = list(range(30000))
nrofletters = 3000
excitment = True
review = 'positive'


start_time = time.time()
summarydf = pd.DataFrame()

indexlist = []
nrofletterlist = []
excitmentlist = []
reviewlist = []

for library in librarynr:
    for book in books:
        for section in sections:
            indexlist.append(str(library)+book+section)
            nrofletterlist.append(nrofletters)
            
            #example of variable calculation depending on iterators of loop:
            if (library % 2 == 0) or (book[1] == 'L'):
                excitment = False
            else:
                excitment = True
                
            excitmentlist.append(excitment)
            reviewlist.append(review)
            
summarydf['index'] = indexlist
summarydf['nrofletters'] = nrofletterlist
summarydf['excitment'] = excitmentlist
summarydf['review'] = reviewlist
listtime = time.time() - start_time
print(listtime)
crisprog
  • 2,024
  • 1
  • 8
  • 14

2 Answers2

2

Append is very slow, you should produce your DataFrame in one shot.

IIUC, you want a product of all possibilities. You can use itertools.product

I am giving here an example with only librarynr = 5. Your condition with librarynr = 300000 would produce 4.8 millions rows.

from itertools import product

librarynr = 5

idx = map(''.join, product(map(str, range(librarynr)), books, sections))

df = pd.DataFrame([], index=idx)

df[['nrofletters', 'excitment', 'review']] = [3000, True, 'positive']

Output:

>>> print(df.reset_index())

                           index  nrofletters  excitment    review
0    0LordOfTheRingsIntroduction         3000       True  positive
1        0LordOfTheRingsMainPart         3000       True  positive
2       0LordOfTheRingsPlottwist         3000       True  positive
3   0LordOfTheRingsSurprisingEnd         3000       True  positive
4       0HarryPotterIntroduction         3000       True  positive
..                           ...          ...        ...       ...
75       4LoveStorySurprisingEnd         3000       True  positive
76       4RandomBookIntroduction         3000       True  positive
77           4RandomBookMainPart         3000       True  positive
78          4RandomBookPlottwist         3000       True  positive
79      4RandomBookSurprisingEnd         3000       True  positive

[80 rows x 4 columns]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks for suggestion, but in my real-world scenario, the variables are computed on the fly and differ for every row. Therefore I can't use df[['nrofletters', 'excitment', 'review']] = [3000, True, 'positive'] because this only works if all values are the same for each row... I updated the code in the question to make this clearer. – crisprog Dec 25 '21 at 23:45
  • 1
    Then I'm afraid there isn't much to optimize. The loop is the bottleneck... especially with 4.8M rows... – mozway Dec 25 '21 at 23:47
1

CPython loops are very slow due to the interpreter (note that there are faster Python interpreter like PyPy which use a JIT compiler). You can use comprehension list to speed up significantly the loops. Using itertools can help a bit more as well as converting library to a string in the outermost loop. However, the result is still not very fast regarding the performed operation.

Another issue comes from the conversion from Python list to Numpy arrays. Indeed, Pandas use Numpy internally and Numpy convert each string reference to a static bounded string (so using a big raw memory buffer and not an array of reference-counted objects). This means that each strings are parsed and copied by Numpy which is very expensive. The best solution is to directly write the Numpy array using vectorized functions if possible. If this is not possible, you can use Numba. However, note that Numba barely supports string arrays so far. Another possible solution is you use Cython. Using a direct assignment of Pandas can be very fast too to set all the strings at once (since the string is only parsed once by Numpy internally).

On my machine, about 2/3 of the time is spent in the loop and 1/3 in the Numpy string conversion (a minor part is coming from some additional Pandas overheads).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • But if you would have, let's say 20 variables which are all calculated within the loop and added to a separate list, list comprehensions would have to be made for every variable, right? I would expect that it would get slower then because it has to loop for every variable through everything. – crisprog Dec 26 '21 at 12:27
  • 1
    Yes. However, I am not sure it would be slower because internal loops used by comprehension lists are significantly faster and the cost of the append is spread in the multiple implicit loops. Still, Numpy arrays are the way to go since the list would cause a pretty big overhead due to the conversion anyway. This is why I emphasized Numpy, Numba and Cython. – Jérôme Richard Dec 26 '21 at 14:02