0

I am trying to merge two DataFrames so they only merge based on matching values in a column 'diagnosis'. I am getting duplicates now from the first dataset and I only need to add columns that matches 0 or 1.

I have tried inner join, outer, left and right.

diagnosis   radius_mean texture_mean    perimeter_mean  area_mean   smoothness_mean compactness_mean    concavity_mean  concave points_mean symmetry_mean   ... fractal_dimension_worst thickness   size    shape   adhesion    single  nuclei  chromatin   nucleoli    mitosis
0   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 8   10  10  8   7   10.0    9   7   1
1   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 5   3   3   3   2   3.0 4   4   1
2   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 8   7   5   10  7   9.0 5   5   4
3   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 7   4   6   4   6   1.0 4   3   1
4   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 10  7   7   6   4   10.0    4   1   2
5   1   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.1471  0.2419  ... 0.11890 7   3   2   10  5   10.0    5   4   4
6   1   17.99
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
Chique_Code
  • 1,422
  • 3
  • 23
  • 49
  • I have tried the following code. According to the docs the inner join was supposed to perform inner merge to only combine values in the column_name that match. ```merged_df = pd.merge(new_df_32, new_df_10, on="diagnosis", how='right') ``` – Chique_Code Aug 01 '19 at 01:49
  • It'll help if you give some examples of what your two dataframes are. – niuer Aug 01 '19 at 03:19

1 Answers1

0

I just reply this way, because longer texts are not so convenient in comments.

What you observe is the typical behavior of merge in case both dataframes contain multiple rows for the columns you specify with on. If for example you have m records in new_df_32 with diagnosis=1 and n records in new_df_10, you will get n*m records in your merged dataframe with diagnosis=1.

If that is not, what you intended and you want to just add fields from new_df_10 to new_df_32, there are two possibilities:

  • Either the join criteria you use is not complete and there are other columns you need to include in the on parameter.
  • Or join new_df_32 with an aggregated dataframe produced from new_df_10. This would mean, you join all rows of new_df_32 with a dataframe just containing one row for diagnosis=0 and one for diagnosis=1.

The second case would look like:

df10_aggregated= new_df_10.groupby(['diagnosis']).agg('first')
merged_df = pd.merge(new_df_32, df10_aggregated, left_on="diagnosis", right_index=True, how='left')

You just would have to elaburate, how you actually need to aggregate the dataframe (which aggregation function for which column) and take into account, that in case of first ordering of the dataframe plays a role, so in that case, you might also need to sort it first (using sort_values).

jottbe
  • 4,228
  • 1
  • 15
  • 31
  • That definitely helps! df_10 still has duplicates values though, but it doesn't give me 209 thousands of rows in return (only 503 which is much better for my purposes). I was wondering if there is any way to add all the columns from df_10 to df_32, only if they have matching diagnoses (0 or 1) so this way Instead of having 32 columns in df_32 and 10 columns in df_10 I have 42 columns in my merged table (this is for the machine learning part) I want each row to be unique so to speak, with different features (the only common they'll have is diagnosis). Really do appreciate your help! – Chique_Code Aug 01 '19 at 20:32
  • Hi, I don't understand, why you still have duplicates after aggregating the dataframe. How did you aggregate it? if you just aggregate on `diagnosis` you should get exactly two result columns. If you can aggregate df_10 like this, you should get exactly what you described after merging the two dataframes. – jottbe Aug 01 '19 at 20:41
  • I mean group on `diagnosis` and apply an aggregate on the rest of the columns. Ups sorry, I wrote `how='right'` above. It should read `how='left'`. But I guess that doesn't change anything, because given your data you could even do an inner join (if both sides only contain 0 and 1 in the column). – jottbe Aug 01 '19 at 20:45
  • This is the code I used ```df_10_aggregated = dropped_df_10.groupby(['diagnosis']).agg('first') merged_df = pd.merge(dropped_df_32, df_10_aggregated, left_on="diagnosis", right_index=True, how='inner')``` Left side of my merged table (which is df_32 has all the rows with unique) and right side of the table (new columns that I merged from df_10) has all the same values, meaning I have 200 rows with the same values, the other 200 with another duplicated values. I have tried "how=left, right, inner". Bangin my head againts the wall at this point. – Chique_Code Aug 01 '19 at 21:38
  • Your `df_10_aggregated` should only contain as many records as you have distinct values in `diagnosis` in `dropped_df_10`. Can you check that? Are there really 200 distinct values? If there are more rows, than distinct values, then something strange is going on (maybe a bug in pandas, but that still seems unlikely to me). – jottbe Aug 01 '19 at 22:40