2

I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame.

The end format of the data frame I would like is something like the following:

df:

id   2018-01 2018-02 2018-03  
234     2       1       3
345     4       5       1
534     5       3       4
234     2       2       3

When I iterate through the data in the loop I have a dictionary with the id, the month and the value for the month, for example:

{'id':234,'2018-01':2}
{'id':534,'2018-01':5}
{'id':534,'2018-03':4}
           .
           .
           .

What is the best way to take an empty data frame and add rows and columns with their values to it in a loop?

Essentially as I iterate it would look something like this

df:

id   2018-01   
234     2       

then

df:

id   2018-01   
234     2  
534     5

then

df:

id   2018-01   2018-03
234     2  
534     5         4

and so on...

Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68

3 Answers3

0

IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value

df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]

for x in l:
    df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
    print(df)

     2018-01
id          
234        2
     2018-01
id          
234        2
534        5
     2018-01  2018-03
id                   
234      2.0      NaN
534      5.0      4.0
BENY
  • 317,841
  • 20
  • 164
  • 234
0

It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:

df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')

If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:

def datagen(your_input):
    for item in your_input:
        # your code to make a dict
        yield dict

The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.

okartal
  • 2,806
  • 1
  • 15
  • 11
  • Can you give more info on why it's adding data to an existing DF in a loop is "expensive"? As you mentioned, for *memory* the in-place adding of rows would be more memory-efficient; it would be useful to know how time-efficient. I understand that time per iteration could be a concern for millions of rows, but perhaps for 1000's it is a non-issue. – Demis Mar 22 '21 at 16:28
0

The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:

df.loc[ len(df) ] = [new, row, of, data]

This will "append" the new data row to the end of the dataframe in-place.

The above example is for an empty Dataframe with exactly 4 columns, such as:

df = pandas.DataFrame(  columns=["col1", "col2", "col3", "col4"]  )

df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row. For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.

–––––

I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.

–––––

However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.

Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:

df:

id   date      data
234  2018-01   2
534  2018-01   5
535  2018-03   4

(has two entries for the same date).

Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.

Demis
  • 5,278
  • 4
  • 23
  • 34
  • BTW, significant confusion appears to come from the fact that the `DataFrame.append()` function operates differently from `List.append()` - List's will `append()` in-place, which DataFrames `append()` returns a new array. Probably both have the same "cost", requiring the allocation of a whole new array's worth of memory, but `pandas` makes this obvious while `List` hides this from the user. – Demis Apr 08 '21 at 17:00