I have the following code:
dfs = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")
The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02.....
and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01....
All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name
. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()]
however, since the J1_01,J1_02,J2_01,J2_02,J3_01....
will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()]
code to just drop the three Id, Id2, Label Name
specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!
Edit: Here's a screenshot of the merged csv file.
I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.