0

I get "ValueError: columns overlap but no suffix specified" when I run my code for .xlsx files but not for .txt files. The data in these two different filetypes are identical. The following works fine:

import os
import pandas as pd

path = r'C:\Users\Me\1Test'
filelist = []
for root, dirs, files in os.walk(path):
    for f in files:
        if not f.endswith('.txt'):
            continue
        filelist.append(os.path.join(root, f))
for f in filelist:
    df = pd.read_table(f)
    col = df.iloc[ : , : -3] 
    df['Average'] = col.mean(axis = 1)
    col1 = df.iloc[ :, 1 : -3]
    df['Err'] = col1.sem(axis = 1)
    out = (df.join(df.drop(df.columns[[-3,-1]], axis=1)
               .sub(df[df.columns[-3]], axis=0)
               .add_suffix(' - Background')))
    out.to_excel(f.replace('txt', 'xlsx'), 'Analyzed Data')

The following gets the ValueError:

import os
import pandas as pd

path = r'C:\Users\Me\1Test'
filelist = []
for root, dirs, files in os.walk(path):
    for f in files:
        if not f.endswith('.xlsx'):
            continue
        filelist.append(os.path.join(root, f))
for f in filelist:
    df = pd.read_excel(f)
    col = df.iloc[ : , : -3] 
    df['Average'] = col.mean(axis = 1)
    col1 = df.iloc[ :, 1 : -3]
    df['Err'] = col1.sem(axis = 1)
    out = (df.join(df.drop(df.columns[[-3,-1]], axis=1)
               .sub(df[df.columns[-3]], axis=0)
               .add_suffix(' - Background')))
    out.to_excel('Analyzed Data')

Each file has a different amount of columns named 'ROI' + numbers and the 3rd to last column has a random name, which is the background. I want to run through the above functions for each file. Example df:

ROI005 ROI008 53141 AVG ERR
1 2 5 1 2.67 1.2
2 4 2 2 2.67 .67
3 3 3 1 3 0

Desired output:

ROI005 ROI008 53141 AVG ERR ROI005 - Background ROI008 - Background Average - Background
1 2 5 1 3.5 1.5 1 4 2.5
2 4 2 2 3 1 2 0 1
3 3 3 1 3 0 2 2 2

1 Answers1

1

I do not understand what you are trying to do, so i only can give some general tips.

The given error raises, if you join dataframes which have one or more equal column names, so pandas can not distinguish them. And if i read your code correct you join df with itself, so there will be of course same columns. To better see what happens you can add lsuffix and rsuffix like the Error tries to tell you (it's a bit cryptic). This will "fix" the error and creates suffixes on the columns.

df.join(..., lsuffix='_left', rsuffix='_right')

Under the hood join uses mergeand merges on the indexes. You also can use merge directly (with the same output). This is sometimes easier because it handles the columns better (it will not get the specific error) (see other question in SO)

Here is a simple example with selfjoin: SO question

And here they use the suffixes, too.

I do not know why the error only raises from the excel file, but i think the data is not the same (after reading).

Edit:

i have tested your code with your data and get no errors. I created the data with this:

df = pd.DataFrame(columns=["ROI005", "ROI008", "53141", "AVG", "ERR"],
                  data=[
                      [2, 5, 1, 2.67, 1.2],
                      [4, 2, 2, 2.67, 0.67],
                      [3, 3, 1, 3, 0]]
                  )

which leads to:

   ROI005  ROI008  53141   AVG   ERR
0       2       5      1  2.67  1.20
1       4       2      2  2.67  0.67
2       3       3      1  3.00  0.00

and after your script (second):

   ROI005  ROI008  53141   AVG   ERR  Average  Err  ROI005 - Background  ROI008 - Background  53141 - Background  AVG - Background  Average - Background
0       2       5      1  2.67  1.20      3.5  2.0                 0.80                 3.80               -0.20              1.47                  2.30
1       4       2      2  2.67  0.67      3.0  0.0                 3.33                 1.33                1.33              2.00                  2.33
2       3       3      1  3.00  0.00      3.0  1.0                 3.00                 3.00                1.00              3.00                  3.00

Is AVG and Average the same? Is Err and ERR the same? I think you have to procide more data and be more specific. with the two columns used as same i get:

   ROI005  ROI008  53141  AVG  ERR  ROI005 - Background  ROI008 - Background  AVG - Background
0       2       5      1  3.5  NaN                    1                    4               2.5
1       4       2      2  3.0  NaN                    2                    0               1.0
2       3       3      1  3.0  NaN                    2                    2               2.0

This is nearly what you want (only ERR is wrong)

You should also check, if you save the index, but not read it (so it generates a new index and takes the old index as extra column)

D-E-N
  • 1,242
  • 7
  • 14
  • `lsuffix` made the code run but it added two unnamed columns and added "_left" to my new column names. Can I eliminate these? I want to create new columns by subtracting the 3rd to right column from every other column except the most left and most right columns. The excel and text files should be the exact same data formatted the same way since I had previously used `out.to_excel` to convert my .txt to .xlsx files without running any functions through it. I ran the same code through these .xlsx files since I realized it's easier to work with my data starting from .xlsx. – CodingIsFun Apr 12 '22 at 18:50
  • 1
    The addition of ```lsuffix``` is the intended action, so you have no duplicate column names. You should split your actions in single steps and print data after each step to make you clear what is happening. ```df2 = df.drop(df.columns[[-3,-1]], axis=1)``` then ```df3 = df2.sub(df[df.columns[-3]], axis=0)```and so on. (For me the second value of ROI005 - Background looks wrong in the desired output. – D-E-N Apr 12 '22 at 20:08
  • Thank you for the suggestions. I will give them a try. I have edited my example df to fix the errors in the desired output since I quickly typed this out by hand – CodingIsFun Apr 13 '22 at 17:19