1

Given a dataframe with the following format:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1
2       | 22            | 49.84               | 0
2       | 47            | 50.01               | 1
3       | 28            | 49.84               | 0
3       | 22            | 50.01               | 0
3       | 47            | 0.06                | 0

I want to select only the tests that have ATOMIC_NUMBER of 22 AND 28 in POSITION 0, no more no less. So I'd like a filter that returns:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1

EDIT: I'm trying to convert this logic from SQL into python. Here's the SQL code:

select * from compositions 
where compositions.test_id in (

  select a.test_id from (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 22) a

  join (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 28) b

  on a.test_id = b.test_id )
Brandon
  • 232
  • 2
  • 11

1 Answers1

1

You can create a boolean series to capture test_ids and then index the df using the same.

s = df[df['POSITION'] == 0].groupby('TEST_ID').apply(lambda x: ((x['ATOMIC_NUMBER'].count() == 2 ) & (sorted(x['ATOMIC_NUMBER'].values.tolist()) == [22,28])).all())

test_id = s[s].index.tolist()

df[df['TEST_ID'].isin(test_id)]

    TEST_ID ATOMIC_NUMBER   COMPOSITION_PERCENT POSITION
0   1       28              49.84               0
1   1       22              50.01               0
2   1       47              0.06                1
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Thanks, I'm going to try this out. – Brandon Oct 03 '18 at 20:54
  • @Brandon, sure. This solution checks for exactly two values in atomic_number - one 22 and another 28 for position 0. If the cond is met, it returns the test_ids which are used to filter the dataframe. – Vaishali Oct 03 '18 at 20:57
  • @Brandon, great that its worked and thank you for accepting:) – Vaishali Oct 03 '18 at 21:10