1

I have the following code:

dfs = glob.glob(path + "/*.csv") 
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")

The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02..... and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01.... All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()] however, since the J1_01,J1_02,J2_01,J2_02,J3_01.... will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()] code to just drop the three Id, Id2, Label Name specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!

Edit: Here's a screenshot of the merged csv file.

I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.

BoniPhila
  • 111
  • 11
  • Could you just rename the first 3 columns in the CSV file and then read it in? Then drop the remaining duplicates like you're doing. For your second question, look [here](https://stackoverflow.com/questions/28986489/python-pandas-how-to-replace-a-characters-in-a-column-of-a-dataframe). – m13op22 Mar 26 '19 at 19:37
  • Alternatively, you could get the indices of `df.columns`, then select the indices of the columns you do want, then use `df.iloc[:, idxs]` to get the data you want. – m13op22 Mar 26 '19 at 19:45
  • @HS-nebula thanks for the link! The problem is that with the `J1_01,J1_02,J2_01,J2_02,J3_01....`, the columns names vary. For example, the first csv file could have 20 columns following the code pattern but then the next csv file might have only 2 columns, and then the next csv might have 6 for example. It's never going to be the exact number. The only three constant columns are the ID columns such as `Id, Id2, Label Name`. The id columns are the ones I want to get rid of the duplicates, I want to keep the other columns if I make sense. – BoniPhila Mar 26 '19 at 19:47
  • @HS-nebula for your second reply. The index will change depending on what csv files are in the folder so it won't always be the same number. Depending on how many columns the csv file has it may always change. The only thing that remains the same are the ID columns. – BoniPhila Mar 26 '19 at 19:48
  • Okay, I think I see. Could you post [an example](https://stackoverflow.com/help/mcve) of two of your CSV files? – m13op22 Mar 26 '19 at 19:52
  • I'd also suggest renaming this question, since you're not looking to delete the columns, but rather remove certain duplicate columns. – m13op22 Mar 26 '19 at 19:53
  • @HS-nebula Sorry I have no idea how really portray my question. I've been fighting myself too on that. I'll add a screenshot of part of the merged file on the original. Thanks again! I'll add the table as i am able to format it. – BoniPhila Mar 26 '19 at 19:58
  • Based on your image, the column names that are duplicated are really `GEO.id` and `GEO.id2`, right? – m13op22 Mar 26 '19 at 20:08
  • @HS-nebula I re-uploaded the screenshot to better show the full column name. Yes, these get repeated as the next merged csv file comes up. The `HD01_VD01` code starts over every time the next merged csv files also comes up. Hence why the `df.loc[:,~df.columns.duplicated()]` code isn't that useful unless I can specify only the the 'GEO.id', `'GEO.id2', 'GEO.disp'` columns. I would simply use the second row since it has less duplicates but some columns names use the same format for "Total" which I also want to keep. – BoniPhila Mar 26 '19 at 20:17

1 Answers1

0

From your image it seems that the columns you want to keep are the columns that begin with GEO. To do this, you can use regex to match the names, then get the indices of these columns, then splice the dataframe based on the column index.

import re

pattern = r'GEO' # or just "id" or whatever pattern best matches your data

# Returns list of indices that match your pattern
match_idx = [i for i, e in enumerate(df.columns) if re.search(pattern, e)]

# Select all but the first two columns (since you want to keep those)
drop_cols = match_idx[2:]

# Now choose all columns that don't match the indices of the columns you're dropping
usecols = [idx for idx, e in enumerate(df.columns) if idx not in drop_cols]

# Then select your data
df1 = df.iloc[:, usecols]

Note: if you try to select a single column like df['GEO.id'], it will return all the columns called GEO.id, which is why we have to drop the columns by index and not their name.

m13op22
  • 2,168
  • 2
  • 16
  • 35
  • This is it!! Thank you so much @HS-nebula – BoniPhila Mar 26 '19 at 20:46
  • @BoniPhila Glad to help! – m13op22 Mar 26 '19 at 20:51
  • sorry to bother you again but for line `pattern = r'GEO'` , if I want to go through multiple names, how would this be formatted? I tried `pattern = [r'GEO', 'HD0']` and get a unhashable type: 'list'. At the moment I just rewrite the code again just for `HD0` so it's not a biggie, just wanted to see if it was possible to loop it or do a list for multiple values to clean up the code. Thanks again! – BoniPhila Mar 27 '19 at 20:18
  • @BoniPhila you can look [here](https://docs.python.org/3/library/re.html) for more information about regular expressions in Python. Since you want columns that match GEO **or** HD0, you can use the `|` operator which means `or`, so that the pattern becomes `r'GEO|HD0'` – m13op22 Mar 27 '19 at 20:47
  • thank you! I really haven't touched python in years so It's trying to learn back all the basics again thank you! – BoniPhila Mar 28 '19 at 12:51