0

hope you are having a good day.

I have an issue with my code, I have to merge different datasets that I in a list called all_csv. The thing is that the datasets are something like this:

Dataset 1

index   device1
GR03    1110
GR20    1121
*GR08*  1109


Dataset 2

index   device2   
*GR08*  1112
GR01    1114
GR04    1123

(In the code the index is a column called list1)

As you can see, there are sometimes that the index may be the same (GR08). So I made this:

all_files = glob.glob(path + "/*data.csv") # This selects all the .csv files in the path
all_csv = [pd.read_csv(f, sep=',') for f in all_files]
all_csv = [df.set_index(df["list1"]) for df in all_csv] #list1 is the index shown above
all_csv = [df.drop(df.columns[0], axis=1) for df in all_csv] #delete the duplication

df_merged   = pd.concat(all_csv, axis=1) # Here is the error "Reindexing only valid with uniquely valued Index objects"

I understand that to use concat there must be all the same index values, so I tried before to use it without setting the index but I got something like this:

all_files = glob.glob(path + "/*data.csv") 
all_csv = [pd.read_csv(f, sep=',') for f in all_files]
df_merged   = pd.concat(all_csv, axis=1)

Resulting dataset

index  device1  device2  device3
nan    1110     nan      1092
nan    1121     nan      nan
*GR08* 1109     1112     1098
nan    nan      1114     nan
nan    nan      1123     1111

This is correct, but I don't know why is the index values that do not connect in all the datasets appear with a nan.

Has anyone an idea how can I solve this? or an different strategy to solve it.

Thanks for all your answers.

Juank
  • 55
  • 7

1 Answers1

1

I believe a merge is the desired operation here. You can do an outer merge on the list1 column.

df1
    list1  device1
0    GR03     1110
1    GR20     1121
2  *GR08*     1109
df2
    list1  device2
0  *GR08*     1112
1    GR01     1114
2    GR04     1123
df1.merge(df2, how='outer', on=['list1'])
    list1  device1  device2
0    GR03   1110.0      NaN
1    GR20   1121.0      NaN
2  *GR08*   1109.0   1112.0
3    GR01      NaN   1114.0
4    GR04      NaN   1123.0
MYousefi
  • 978
  • 1
  • 5
  • 10
  • It may be a solution, but the thing is that df1, df2, df3...are in a list called all_csv. So technically I don't have the df1 as an individual dataframe – Juank Feb 01 '22 at 00:07
  • 1
    Then you want to take a look at [this](https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes). It basically does the above operation with `reduce` on the list of data frames. – MYousefi Feb 01 '22 at 00:10