I am pulling a big amount of data. It comes as a list of lists of objects.
Example: [[objectA, objectB],[objectC],[],[ObjectD]...]
Each object have a lot of attributes, however for my dataframe I need only name, value, timestamp, description. I tried two things:
for events in events_list:
if len(events) > 0:
for event in events:
df = DataFrame([])
df['timestamp'] = event.timestamp
df['value'] = event.value
df['name'] = event.name
df['desc'] = event.desc
final_df = final_df.append(df)
This takes around ~ 15 minutes to complete.
I change the code to use python list:
df_list = list()
for events in events_list:
if len(events) > 0:
for event in events:
df_list.append([event.timestamp, event.value, event.name, event.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])
With this change I managed to reduce the time to approximately ~10-11 minutes.
I am still researching if there is a way to do it faster. Before I did the change with python list I tried dictionary but it was way slower than I expected. Currently I am reading about Panads vectorization which seems really fast, however I am not sure if I can use it for my purpose. I know that Python loops are a bit slow and there is not much I can do about them, so I am also trying to figure out a way to do those loops in the dataframe.
My question is, has any of you tackled this problem before and is there a better way to do it ?
EDIT: There are questions about the data. It comes through an API and it is constructed this way because every group of objects is grouped by name. For example:
[[objectA, objectB (both have the same name)],[objectC],[EMPTY - There is no data for this name],[ObjectD]...]
Because I cannot change the way I get the data, I have to work with this data structure.