0

Data

I have the following data:

data = {'state': ['Alabama', 'Alabama', 'Alabama', 'Alabama', 'Alabama', 'Wisconsin', 'Wisconsin'],
        'year': [1989, 1989, 1989, 1989, 1990, 2016, 1970],
        'quarter': [1, 2, 3, 4, 1, 4, 4],
        'v': [3.984353, 4.427839, 4.173073, 3.485882, 3.865541, 0.168776, 0.168776]}
df = pd.DataFrame(data)

       state  year  quarter         v
0    Alabama  1989        1  3.984353
1    Alabama  1989        2  4.427839
2    Alabama  1989        3  4.173073
3    Alabama  1989        4  3.485882
4    Alabama  1990        1  3.865541
5  Wisconsin  2016        4  0.168776
6  Wisconsin  1970        4  0.168776

The data includes observed values for states and dates back to 1970. For some states it starts later than 1970.

Goal

I want to keep the states for which I can observe the data for both 1970 and 2016.

Code

The code below includes but doesn't subset the data at all:

df.loc[(df['year'] >= 1970) & (df['year'] <= 2016)]

How can I do that in python?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • What does your _filter_ using `df.loc()` print ? What is wrong with the result? – hc_dev Jun 30 '22 at 18:43
  • That result is wrong because it's not the question he wanted to ask. It just picks all rows between 1970 and 2016, which is essentially everything. You, unfortunately, need a two-step process. You need to construct the list of states for which there is a year 1970 entry and a year 2016 entry, then you need to select the subset of all rows for those states. – Tim Roberts Jun 30 '22 at 18:45
  • when I store it in a new data set, it gives exactly the same data. I wanna delete the states for which data doesn't start from 1970 but at a later that, say 1989. @hc_dev – kiyis_stats Jun 30 '22 at 18:48

1 Answers1

2

If I follow the explanation of @TimRoberts, I think your are looking for:

issubset = lambda x: set([1970, 2016]).issubset(x)
out = df[df.groupby('state')['year'].transform(issubset)]
Corralien
  • 109,409
  • 8
  • 28
  • 52