How to stop Python data frame column duplication in a loop

Question

Currently learning Python and Pandas

I am creating a df with a lot of repetition in the calculations of the columns. I have created a loop to run through the multiple calculations on a selection of columns and create the new columns respectively. When I run the code for the first time it works as intended but the script is required to run multiple times with new data. On the second iteration, the loop duplicates the new columns instead of carrying on with the columns previously created.

I'm sure I'm missing something simple but I can't find anything in the SO archives that tell the loop not to duplicate but use the existing titled columns.

result = pd.read_csv('/Users/Documents/Base.csv')


smas = [100, 50]
headers_to_calc = ['nupl', 'funding rate']

h_count = len(headers_to_calc)
s_count = len(smas)


for h in headers_to_calc:
    for s in smas: 

        sma = 'sma'


        result[h,sma, s] = result[h].rolling(s).mean()

        if s == s_count:
          break 

    if h == h_count:
      break 

result    

result.to_csv ('/Users/Documents/Base.csv')

This creates the columns with the correct rolling averages 100 and 50 for both nupl and funding rate columns nupl sma 100, nupl sma 50,funding rate sma 100 and funding rate sma 50

When the script is run again however all the above columns are duplicated rather than recalculated and populated in the now existing headed up columns.

I'm thinking I need potentially an If statement that IF columns already exist do not recreate duplicate columns or maybe in the loop instantly merge the duplicate columns based on their nearly identical titles.

Hi and welcome James. Please have a look at [this thread](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples/20159305#20159305) and provide some reproducible data. In general, looping over a dataset is mostly not best practice and should be avoided. — Marco_CH, Jan 20 '22 at 13:52

Kcode · Accepted Answer · 2022-01-20T14:19:33.653

1

Currently the dataframe columns that are being created are tuples. A simple solution might be to turn the column names generated into strings like so:

result["_".join([str(h),str(sma), str(s)])] = result[h].rolling(s).mean()

Then on the next run the columns should not be duplicated. I had to add index_col=[0] to the pd.read_csv to avoid the creation of unnamed columns on the next pd.read_csv run.

result = pd.read_csv('/Users/Documents/Base.csv', index_col=[0])

edited Jan 20 '22 at 14:19

answered Jan 20 '22 at 14:13

Kcode

150
7

Thank you for your help this has solved it. Apologies for my poorly formed question too i have since read up and understand where i went wrong so thanks again for bearing with me. – James Cabourne Jan 20 '22 at 15:57

How to stop Python data frame column duplication in a loop

1 Answers1