1

I have below data:

year code   value  
2003  A      12  
2003  B      11  
2003  C      12  
2004  A      14  
2004  B      15  
2004  C      13  
2004  E      16  
2005  A      9  
2005  B      18  
2005  C      16  
2005  F      8  
2005  G      19  

I WANT TO RETAIN ONLY THOSE CODES THAT ARE PRESENT FOR EVERY YEAR.

From the above dataframe I need to extract all the rows that have codes appear in the years (2003, 2004, 2005). Which means I should have a new df with 9 rows for codes A, B and C. I tried using groupby and isin() but unable to get exactly what I need.

cs95
  • 379,657
  • 97
  • 704
  • 746
Techflu
  • 55
  • 1
  • 8

4 Answers4

2

Without groupby

df.set_index(['year','code']).unstack().dropna(axis=1).stack().reset_index()
Out[528]: 
   year code  value
0  2003    A   12.0
1  2003    B   11.0
2  2003    C   12.0
3  2004    A   14.0
4  2004    B   15.0
5  2004    C   13.0
6  2005    A    9.0
7  2005    B   18.0
8  2005    C   16.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I would love with shorter code but this did not work for me :( . I get an error "Index contains duplicate entries, cannot reshape. – Techflu Jan 23 '18 at 15:49
  • @Techflu It should be work :-) Base on your sample data , unless you have multiple key in code – BENY Jan 23 '18 at 15:59
  • I will try again and get back. Maybe other columns in my dataset are messing it up. I will slice it further and test your code. – Techflu Jan 23 '18 at 16:08
  • it works for the exact dataset as above but even if I slice my dataset to 3 columns it throws me that error I described earlier. I believe it is something to do with the index columns I had kept. – Techflu Jan 23 '18 at 16:33
1

I believe you need filtering by isin, but if want dynamically get all values which are in all years use reduce:

s = df.groupby('year')['code'].apply(list)

from functools import reduce
a = reduce(lambda x, y: set(x) & set(y), s)
print (list(a))
['C', 'A', 'B']

df = df[df['code'].isin(list(a))]
print (df)
   year code  value
0  2003    A     12
1  2003    B     11
2  2003    C     12
3  2004    A     14
4  2004    B     15
5  2004    C     13
7  2005    A      9
8  2005    B     18
9  2005    C     16
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I say inefficient in the limited context of the problem described here. This might in fact solve OP's actual problem, but we'll never know what it is since they never made it clear. Based on this limited context there are 10 equally good solutions to the problem. Highly vexing. – cs95 Jan 23 '18 at 15:57
  • Also, ["hi", "thanks" and "welcome" are all noise](https://meta.stackexchange.com/questions/2950/should-hi-thanks-taglines-and-salutations-be-removed-from-posts). – cs95 Jan 23 '18 at 16:01
1

You can also try a query-based method.

df.query("2005 >= year >= 2003 and code in ['A', 'B', 'C']")


    year    code    value
0   2003    A       12
1   2003    B       11
2   2003    C       12
3   2004    A       14
4   2004    B       15
5   2004    C       13
7   2005    A       9
8   2005    B       18
9   2005    C       16
Tai
  • 7,684
  • 3
  • 29
  • 49
1

You could use

Option 1

In [647]: codes = pd.crosstab(df.year, df.code).replace({0: np.nan}).dropna(axis=1).columns

In [648]: df.query('code in @codes')
Out[648]:
   year code  value
0  2003    A     12
1  2003    B     11
2  2003    C     12
3  2004    A     14
4  2004    B     15
5  2004    C     13
7  2005    A      9
8  2005    B     18
9  2005    C     16

Option 2

In [657]: codes = df.groupby(['year', 'code']).size().unstack().dropna(axis=1).columns

In [658]: df[df.code.isin(codes)]
Out[658]:
   year code  value
0  2003    A     12
1  2003    B     11
2  2003    C     12
3  2004    A     14
4  2004    B     15
5  2004    C     13
7  2005    A      9
8  2005    B     18
9  2005    C     16
Zero
  • 74,117
  • 18
  • 147
  • 154
  • I am not sure what this, or the other answer is doing, but my guess is it's better – cs95 Jan 23 '18 at 16:11
  • @cᴏʟᴅsᴘᴇᴇᴅ -- So what I 'guessed' was, OP needs set of codes that at present in all/given years. – Zero Jan 23 '18 at 16:17
  • How hard would it have been to _say_ that from the start, I wonder – cs95 Jan 23 '18 at 16:18
  • @Zero - you got the question correct. Much simpler and easier code :) By the way I used option - 2 (to be clear..!) – Techflu Jan 23 '18 at 16:38