9

I have two data frames A and B of unequal dimensions. I would like to create a data frame C such that it ONLY contains rows that are unique between A and B. I tried to follow this solution (excluding rows from a pandas dataframe based on column value and not index value) but could not get it to work.

Here's example:

Assume this to be DF_A:

    Star_ID         Loc_ID      pmRA        pmDE    Field     Jmag    Hmag  
 2M00000032+5737103  4264    0.000000    0.000000    N7789   10.905  10.635
 2M00000068+5710233  4264    8.000000    -18.000000  N7789   10.664  10.132
 2M00000222+5625359  4264    0.000000    0.000000    N7789   11.982  11.433
 2M00000818+5634264  4264    0.000000    0.000000    N7789   12.501  11.892
 2M00001242+5524391  4264    0.000000    -4.000000   N7789   12.091  11.482

And this to be DF_B:

2M00000032+5737103  
2M00000068+5710233
2M00001242+5524391

So, the first two and last Star_ID are common between DF_A and DF_B. I would like to create DF_C such that:

DF_C:

        Star_ID         Loc_ID      pmRA        pmDE    Field     Jmag    Hmag
     2M00000222+5625359  4264    0.000000    0.000000    N7789   11.982  11.433
     2M00000818+5634264  4264    0.000000    0.000000    N7789   12.501  11.892
Community
  • 1
  • 1
Rohit
  • 5,840
  • 13
  • 42
  • 65

1 Answers1

14

This worked for me:

In [7]:

df1[~df1.Star_ID.isin(df2.Star_ID)]

Out[7]:

              Star_ID  Loc_ID  pmRA  pmDE  Field    Jmag    Hmag
2  2M00000222+5625359    4264     0     0  N7789  11.982  11.433
3  2M00000818+5634264    4264     0     0  N7789  12.501  11.892

[2 rows x 7 columns]

So what we do here is we create a boolean mask, we ask for where Star_ID values is in both dataframes, however by using the ~ we NOT the condition which in effect negates it. The one you linked to is pretty much the same thing but I think you maybe didn't understand the syntax?

EDIT

In order to get both values that are only in df1 and values that are only in df2 you could do this

unique_vals = df1[~df1.Star_ID.isin(df2.Star_ID)].append(df2[~df2.Star_ID.isin(df1.Star_ID)], ignore_index=True)

Further edit

So the problem was that the csv had leading spaces, this caused all values to be unique in both datasets, to correct this you need to do this:

df1.Apogee_ID = df1.Apogee_ID.str.lstrip()
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Well, I tried the linked example this way: idx = df_mds['2MASS_ID'].isin(df_survey['Apogee_ID']) df_new = df_survey[~idx]. This gave me the following error: "Unalignable boolean Series key provided". Perhaps, the names have to be the same in two dataframes? – Rohit May 04 '14 at 19:25
  • Have you tried my method above, assuming your column data is correct then what you wrote probably should've worked, can you post your problematic data if the above does not work – EdChum May 04 '14 at 19:28
  • How you actually format the comment? My above comment not in the format I like it to be in. – Rohit May 04 '14 at 19:31
  • No, I am about to try. Will let you know in few minutes. Thanks. – Rohit May 04 '14 at 19:31
  • Which comment? you mean like `this`? you just put ` accents around the text (grave accents I think they are called) – EdChum May 04 '14 at 19:32
  • 1
    I don't think this is quite right; it doesn't capture ids in df2 not in df1? – exp1orer May 04 '14 at 19:32
  • @exp1orer yes that is true but it does work on the example data, I'll update my answer to satisfay that condition – EdChum May 04 '14 at 19:34
  • yeah, it does not work on my data. I will send you my files as dropbox link. Thanks! Oh, will try the edit first. – Rohit May 04 '14 at 19:41
  • No, it does not work. I JUST want stars that are NOT common between the two lists. Here are the two lists I'm working on (https://db.tt/qoC51XES). Note that DF_A = 'List_A' and DF_B = 'MDWARF_CURRENT_LIST.txt'. I expect the unique number of stars to be about 95. Your example either gives me 348 or 608 stars. Thank you SO MUCH for working this out for me! – Rohit May 04 '14 at 19:54
  • Well the first problem I can see is that you have leading spaces in your ListA.csv for `Apogee_ID`, is that your problem? – EdChum May 04 '14 at 19:59
  • No, this is the command I have used: "unique_vals = df_Mdwarfs[~df_Mdwarfs.Apogee_ID.isin(df_mds.Apogee_ID)]" – Rohit May 04 '14 at 20:02
  • where df_Mdwarfs is DF_A and DF_B = df_mds – Rohit May 04 '14 at 20:03
  • OK I get for Mdwarfs 255 total, of which 186 are present in the other df, 69 are not for mds dataframe I get 348 total, of which 186 are present in Mdwarfs, leaving 162 not present, so I would expect the total of unique vals from both dfs to be 162 + 69 = 231. Is this what you are expecting? – EdChum May 04 '14 at 20:12
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/51983/discussion-between-aging-gorrila-and-edchum) – Rohit May 04 '14 at 20:14
  • Great answer... Exactly what I have searched for – chainstair May 04 '21 at 09:41