Unnecessary duplication is created while creating new dataframe that takes values from another by iterating over column values

Question

I am trying to add values taken from one dataframe column by iterating over unique values (contract numbers). For smaller numbers of iteration, the script works perfectly. However, iterating over 1000 unique values, it creates duplicate values in the resulting dataframe, which in turn slows the processing speed and taking an unnecessarily long time for a processing. How should I make it more efficient?

https://i.stack.imgur.com/AxcqY.jpg - original dataframe

https://i.stack.imgur.com/Qtu3x.jpg - unnecessary duplicate dataframes in new dataframe

https://imgur.com/3i5gMoJ- unnecessary duplicate dataframes in new dataframe

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164','123','123'],
                   ["AB1111",'2018-08-15 00:03:00','564','453','126'],
                   ["AB1111",'2018-08-15 00:10:00','364','1231','1223'],
                   ["AB1111",'2018-08-15 00:01:00','564','575','1523'],
                   ["CD1111",'2018-08-16 00:12:00','514','341','1213'],
                   ["CD1111",'2018-08-15 00:02:00','564','1234','123'],
                   ["CD1111",'2018-08-16 00:05:00','564','341','124'],
                   ["CD1111",'2018-08-16 00:03:00','64','341','123'],
                   ["EF1111",'2018-08-15 00:00:00','534','341','121'],
                   ["EF1111",'2018-08-17 00:01:00','564','341','163'],
                   ["EF1111",'2018-08-15 00:09:00','524','341','129']],
                   columns = ['contract', 'datetime',
                              'real_cons','solar_gen','battery_charge'])


# converting datetime column datatype to "datetime"
df['datetime'] = pd.to_datetime(df['datetime']) 

#aggregation dataframe (new dataframe)
df_agg1 = pd.DataFrame()

for contract in df['contract'].unique()[:1500]:
    print(contract)
    df_contract = df.copy()[df['contract']==contract]    # selecting each full dataframe from the main DF
    df_contract.set_index('datetime', inplace=True)      # set "datetime" column as an index
    df_contract.sort_index(inplace=True)                 # sort index
    df_contract = df_contract.loc['2018-8-15']           # select timeframe       
    # creating GB61074_cons column, which will be added to df_agg, from df_contract 'real_cons' column
    df_contract[f'{contract}_con'] = df_contract['real_cons']   

    if df_agg1.empty:
        df_agg1 = df_contract[[f'{contract}_con']]        # first column 
    else:
        df_agg1 = df_agg1.join(df_contract[f'{contract}_con'])     # subsequent columns 

df_agg1

How to create the new dataframe without creating these unnecessary duplicates? What is causing them to be created?

https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Mahendra Singh, Nov 07 '19 at 07:10
I can´t see any duplicates! Can you specify what you mean by duplicates? — mgruber, Nov 07 '19 at 07:57
@mgruber, hi, if you refer to the second and third images, you will see the duplicates created in the new dataframe. — Amarbold Altangerel, Nov 07 '19 at 08:09
@AmarboldAltangerel do you want ```contract``` to be the new column headers or ```real_cons```? Because for me ```contract``` when I print out the result — mgruber, Nov 07 '19 at 08:12

score 1 · Accepted Answer · answered Nov 07 '19 at 08:37

This is a way without using the for loop to achieve the exact same result. For readability I used multiple lines to add explantion.

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164'],
                   ["AB1111",'2018-08-15 00:03:00','564'],
                   ["AB1111",'2018-08-15 00:10:00','364'],
                   ["AB1111",'2018-08-15 00:01:00','564'],
                   ["CD1111",'2018-08-16 00:12:00','514'],
                   ["CD1111",'2018-08-15 00:02:00','564'],
                   ["CD1111",'2018-08-16 00:05:00','564'],
                   ["CD1111",'2018-08-16 00:03:00','64'],
                   ["EF1111",'2018-08-15 00:00:00','534'],
                   ["EF1111",'2018-08-17 00:01:00','564'],
                   ["EF1111",'2018-08-15 00:09:00','524']],
                   columns = ['contract', 'datetime','real_cons'])


df = df.set_index(['datetime','contract']).unstack().add_suffix('_con')
df = df.droplevel(level=0,axis=1) #drops the 'real_cons' index
df = pd.DataFrame(df.to_records()) #workaround the remove multiindex
df['datetime'] = pd.to_datetime(df['datetime']) #change datetime column to datetime datatype
df = df.set_index('datetime').loc['2018-08-15'] #filter data on date

print(df.reset_index())

Result:

             datetime AB1111_con CD1111_con EF1111_con
0 2018-08-15 00:00:00        164        NaN        534
1 2018-08-15 00:01:00        564        NaN        NaN
2 2018-08-15 00:02:00        NaN        564        NaN
3 2018-08-15 00:03:00        564        NaN        NaN
4 2018-08-15 00:09:00        NaN        NaN        524
5 2018-08-15 00:10:00        364        NaN        NaN

Sorry, I edited the dataframe a bit. Dataframe has multiple columns. How does it work when I try to select a particular "real_cons" column and add it to a new dataframe? When I try your code, it gets values from the columns it isn't supposed to get data from. — Amarbold Altangerel, Nov 12 '19 at 05:13
Which columns should appear as a header? Which column should be the value? — mgruber, Nov 12 '19 at 07:36
Sorry for the trouble. I have found a different approach with the pivot_table method. Thank you. — Amarbold Altangerel, Nov 12 '19 at 08:07

Unnecessary duplication is created while creating new dataframe that takes values from another by iterating over column values

1 Answers1