0

I am trying to add values taken from one dataframe column by iterating over unique values (contract numbers). For smaller numbers of iteration, the script works perfectly. However, iterating over 1000 unique values, it creates duplicate values in the resulting dataframe, which in turn slows the processing speed and taking an unnecessarily long time for a processing. How should I make it more efficient?

https://i.stack.imgur.com/AxcqY.jpg - original dataframe

https://i.stack.imgur.com/Qtu3x.jpg - unnecessary duplicate dataframes in new dataframe

https://imgur.com/3i5gMoJ- unnecessary duplicate dataframes in new dataframe

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164','123','123'],
                   ["AB1111",'2018-08-15 00:03:00','564','453','126'],
                   ["AB1111",'2018-08-15 00:10:00','364','1231','1223'],
                   ["AB1111",'2018-08-15 00:01:00','564','575','1523'],
                   ["CD1111",'2018-08-16 00:12:00','514','341','1213'],
                   ["CD1111",'2018-08-15 00:02:00','564','1234','123'],
                   ["CD1111",'2018-08-16 00:05:00','564','341','124'],
                   ["CD1111",'2018-08-16 00:03:00','64','341','123'],
                   ["EF1111",'2018-08-15 00:00:00','534','341','121'],
                   ["EF1111",'2018-08-17 00:01:00','564','341','163'],
                   ["EF1111",'2018-08-15 00:09:00','524','341','129']],
                   columns = ['contract', 'datetime',
                              'real_cons','solar_gen','battery_charge'])


# converting datetime column datatype to "datetime"
df['datetime'] = pd.to_datetime(df['datetime']) 

#aggregation dataframe (new dataframe)
df_agg1 = pd.DataFrame()

for contract in df['contract'].unique()[:1500]:
    print(contract)
    df_contract = df.copy()[df['contract']==contract]    # selecting each full dataframe from the main DF
    df_contract.set_index('datetime', inplace=True)      # set "datetime" column as an index
    df_contract.sort_index(inplace=True)                 # sort index
    df_contract = df_contract.loc['2018-8-15']           # select timeframe       
    # creating GB61074_cons column, which will be added to df_agg, from df_contract 'real_cons' column
    df_contract[f'{contract}_con'] = df_contract['real_cons']   

    if df_agg1.empty:
        df_agg1 = df_contract[[f'{contract}_con']]        # first column 
    else:
        df_agg1 = df_agg1.join(df_contract[f'{contract}_con'])     # subsequent columns 

df_agg1

How to create the new dataframe without creating these unnecessary duplicates? What is causing them to be created?

1 Answers1

1

This is a way without using the for loop to achieve the exact same result. For readability I used multiple lines to add explantion.

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164'],
                   ["AB1111",'2018-08-15 00:03:00','564'],
                   ["AB1111",'2018-08-15 00:10:00','364'],
                   ["AB1111",'2018-08-15 00:01:00','564'],
                   ["CD1111",'2018-08-16 00:12:00','514'],
                   ["CD1111",'2018-08-15 00:02:00','564'],
                   ["CD1111",'2018-08-16 00:05:00','564'],
                   ["CD1111",'2018-08-16 00:03:00','64'],
                   ["EF1111",'2018-08-15 00:00:00','534'],
                   ["EF1111",'2018-08-17 00:01:00','564'],
                   ["EF1111",'2018-08-15 00:09:00','524']],
                   columns = ['contract', 'datetime','real_cons'])


df = df.set_index(['datetime','contract']).unstack().add_suffix('_con')
df = df.droplevel(level=0,axis=1) #drops the 'real_cons' index
df = pd.DataFrame(df.to_records()) #workaround the remove multiindex
df['datetime'] = pd.to_datetime(df['datetime']) #change datetime column to datetime datatype
df = df.set_index('datetime').loc['2018-08-15'] #filter data on date

print(df.reset_index())

Result:

             datetime AB1111_con CD1111_con EF1111_con
0 2018-08-15 00:00:00        164        NaN        534
1 2018-08-15 00:01:00        564        NaN        NaN
2 2018-08-15 00:02:00        NaN        564        NaN
3 2018-08-15 00:03:00        564        NaN        NaN
4 2018-08-15 00:09:00        NaN        NaN        524
5 2018-08-15 00:10:00        364        NaN        NaN
mgruber
  • 751
  • 1
  • 9
  • 26
  • Sorry, I edited the dataframe a bit. Dataframe has multiple columns. How does it work when I try to select a particular "real_cons" column and add it to a new dataframe? When I try your code, it gets values from the columns it isn't supposed to get data from. – Amarbold Altangerel Nov 12 '19 at 05:13
  • Which columns should appear as a header? Which column should be the value? – mgruber Nov 12 '19 at 07:36
  • 1
    Sorry for the trouble. I have found a different approach with the pivot_table method. Thank you. – Amarbold Altangerel Nov 12 '19 at 08:07