0

I continue to use pandas.dataframe.append, however appending in the for-loop continues to get written over and information is duplicated. The out put of the for-loop should look exactly like the original dataframe (I am not including functions to eliminate any complexity). Any help would be appreciated.

import pandas as pd
df = pd.DataFrame({
    'date': ['2019-01-01','2019-01-01','2019-01-01',
             '2019-02-01','2019-02-01','2019-02-01',
             '2019-03-01','2019-03-01','2019-03-01',],
    'Asset': ['Asset A', 'Asset A', 'Asset A', 'Asset B', 'Asset B', 'Asset B',
             'Asset C', 'Asset C', 'Asset C'],
    'Monthly Value': [2100, 8100, 1400, 1400, 3100, 1600, 2400, 2100, 2100]
})
print(df.sort_values(by=['Asset']))

         date    Asset  Monthly Value
0  2019-01-01  Asset A           2100
1  2019-01-01  Asset A           8100
2  2019-01-01  Asset A           1400
3  2019-02-01  Asset B           1400
4  2019-02-01  Asset B           3100
5  2019-02-01  Asset B           1600
6  2019-03-01  Asset C           2400
7  2019-03-01  Asset C           2100
8  2019-03-01  Asset C           2100

This for loop creates multiple appendages to the df and duplicates rows

assetlist = list(df['Asset'].unique())

for asset in assetlist:
    df_subset = df[df['Asset'] == asset]
    dfcopy = df_subset.copy()
    newdf = newdf.append(dfcopy)
print(newdf)

This output is incorrect, it should look exactly like the original dataframe.

date    Asset   Monthly Value
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
0   2019-01-01  Asset A 2100
1   2019-01-01  Asset A 8100
2   2019-01-01  Asset A 1400
3   2019-02-01  Asset B 1400
4   2019-02-01  Asset B 3100
5   2019-02-01  Asset B 1600
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
0   2019-01-01  Asset A 2100
1   2019-01-01  Asset A 8100
2   2019-01-01  Asset A 1400
3   2019-02-01  Asset B 1400
4   2019-02-01  Asset B 3100
5   2019-02-01  Asset B 1600
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
0   2019-01-01  Asset A 2100
1   2019-01-01  Asset A 8100
2   2019-01-01  Asset A 1400
3   2019-02-01  Asset B 1400
4   2019-02-01  Asset B 3100
5   2019-02-01  Asset B 1600
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
0   2019-01-01  Asset A 2100
1   2019-01-01  Asset A 8100
2   2019-01-01  Asset A 1400
3   2019-02-01  Asset B 1400
4   2019-02-01  Asset B 3100
5   2019-02-01  Asset B 1600
6   2019-03-01  Asset C 2400
7   2019-03-01  Asset C 2100
8   2019-03-01  Asset C 2100
Starbucks
  • 1,448
  • 3
  • 21
  • 49
  • Perhaps you meant to append to `newdf` instead of `dfcopy`? – 0x5453 Jan 15 '20 at 21:41
  • That was a type-o. Edited. Thank you, but still doesnt work. – Starbucks Jan 15 '20 at 21:44
  • I'm a bit confused what you are trying to accomplish... You are iterating over "Asset"s, filtering down to distinct groups, and then recombining those groups into a single DataFrame... won't your desired result be exactly the same as `df`, just in a different order? Can you give an example of the output you expect? – 0x5453 Jan 15 '20 at 21:46
  • Yes, it should look exactly like the same. I have several functions that I took out to save time. – Starbucks Jan 15 '20 at 21:47
  • 1
    It might be easier (and faster) to throw all of your `dfcopy`s into a list and then use [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to combine them at the end. – 0x5453 Jan 15 '20 at 21:49
  • Can you provide an example? – Starbucks Jan 15 '20 at 21:50

1 Answers1

2

You are missing one line I think:

assetlist = list(df['Asset'].unique())
newdf = pd.DataFrame()   # <-- define it as a data frame
for asset in assetlist:
    df_subset = df[df['Asset'] == asset]
    dfcopy = df_subset.copy()
    newdf = newdf.append(dfcopy)
print(newdf)

         date    Asset  Monthly Value
0  2019-01-01  Asset A           2100
1  2019-01-01  Asset A           8100
2  2019-01-01  Asset A           1400
3  2019-02-01  Asset B           1400
4  2019-02-01  Asset B           3100
5  2019-02-01  Asset B           1600
6  2019-03-01  Asset C           2400
7  2019-03-01  Asset C           2100
8  2019-03-01  Asset C           2100

However, an easier way to do this is:

newdf = pd.concat([df.query("Asset == @asset") for asset in assetlist])
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • 1
    Not only easier, but entirely more efficient. `pd.concat`+ `list.append` doesn't lead to the [quadratic copying](https://stackoverflow.com/a/37009561/4333359) of `DataFrame.append` in a loop – ALollz Jan 15 '20 at 21:58