3

I'm trying to write code that collects data from a source online in a loop and manipulates this data with pandas inside each iteration. Initially I was thinking that I should initialise a dict outside of the loop, grab the data, convert the dict to a dataframe inside the loop, and perform my operations on that. But this feels quite strange to make the dictionary instead of just making a dataframe and append to that in the loop. But as I understand it, pandas is not really "designed" for cell-by-cell updating (rather vectorwise). What would be the most efficient approach to this?

import pandas as pd
    d = {'a':[], 'b':[], 'c':[], 'x':[], 'z':[]}
    for i in range(100):
        d['a'].append(f'some info {i}')
        d['b'].append(f'more info {i}')
        d['c'].append(i)
        d['x'].append(i*2)
        d['z'].append(np.nan) # ???

        df = pd.DataFrame(d)
        # Some function that does calculations on df cols and returns df with new cols
        df['z'] = 1 
rafaelc
  • 57,686
  • 15
  • 58
  • 82
fffrost
  • 1,659
  • 1
  • 21
  • 36

1 Answers1

0

Pandas is normally used to perform data manipulation and data modelling so it might be inefficient to add data every time in the loop to the dataframe. Note this would depend heavily on the number of iterations in the loop. if they are very few compared to the final length of dataframe, you can of course do that. Otherwise, it seems best to get all the data in the dictionary inside the loop, and when you are done collecting the data, you could convert that into dataframe for analysis and delete the dictionary then

Parijat Bhatt
  • 664
  • 4
  • 6
  • Well actually I still will need a dataframe inside the loop because I plan to run some analyses. It's just I wonder whether it's better to collect a growing dict and convert it to a df each time, or whether I should just append to a dataframe and ditch the dict. – fffrost Aug 09 '19 at 07:37
  • Why don't you collect the data once in the loop, convert it to dataframe and then run the loop again to perform the operations – Parijat Bhatt Aug 09 '19 at 18:06
  • The data are timeseries, and I'm collecting and analysing in real-time. I need the df to do the analyses in each iteration. So there is no possibility to collect it just once as it is constantly growing. – fffrost Aug 09 '19 at 18:11