The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ]
to reference the last (nonexistent) row, with len(df)
as the index:
df.loc[ len(df) ] = [new, row, of, data]
This will "append" the new data row to the end of the dataframe in-place.
The above example is for an empty Dataframe with exactly 4 columns, such as:
df = pandas.DataFrame( columns=["col1", "col2", "col3", "col4"] )
df.loc[ ]
indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError
, like an numpy.array or List would if you tried to assign to a nonexistent row.
For a brand-new, empty DataFrame, len(df)
returns 0
, and thus references the first, blank row, and then increases by one each time you add a row.
–––––
I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.
–––––
However for the OP's specific case where you also requested to combine the columns
if the data is for an existing identically-named column, you'd need som logic during your for
loop.
Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:
df:
id date data
234 2018-01 2
534 2018-01 5
535 2018-03 4
(has two entries for the same date).
Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique()
and df.sort()
. Will look into that more later.