I'm trying to find a more optimized way to add data to a pandas dataframe. I already saw other related questions where people suggested to first create lists and then add the data to pandas (which I now implemented).
In my current setup I loop through different lists (in the example it is librarynr
, books
and sections
) and then compute various variables (in the example those are not computed but already set; nrofletters
, excitment
and review
) which I add to lists and in the end add the lists to the dataframe.
Does anyone know of further optimizations to improve performance on this example code?
Important note: In my final code, the variables are not the same for all rows, but computed depending on the iterators of the loops (see example calculation of excitment
).
Example code:
import pandas as pd
import time
books = ['LordOfTheRings','HarryPotter','LoveStory','RandomBook']
sections = ['Introduction','MainPart','Plottwist','SurprisingEnd']
librarynr = list(range(30000))
nrofletters = 3000
excitment = True
review = 'positive'
start_time = time.time()
summarydf = pd.DataFrame()
indexlist = []
nrofletterlist = []
excitmentlist = []
reviewlist = []
for library in librarynr:
for book in books:
for section in sections:
indexlist.append(str(library)+book+section)
nrofletterlist.append(nrofletters)
#example of variable calculation depending on iterators of loop:
if (library % 2 == 0) or (book[1] == 'L'):
excitment = False
else:
excitment = True
excitmentlist.append(excitment)
reviewlist.append(review)
summarydf['index'] = indexlist
summarydf['nrofletters'] = nrofletterlist
summarydf['excitment'] = excitmentlist
summarydf['review'] = reviewlist
listtime = time.time() - start_time
print(listtime)