I have a DataFrame that contains an name, a year, a tag and a bunch of other variables. So it might look like this
df = pd.DataFrame({
"name": 4*["A"] + 5*["B"],
"year": [1999,2000,2001,2002,2010,2011,2012,2013,2014],
"tag": [0,1,0,0,1,0,0,1,0],
"x1": np.random.normal(size=9),
"x2": np.random.uniform(size=9)
})
print df
name tag x1 x2 year
0 A 0 -1.352707 0.932559 1999
1 A 1 -1.359828 0.724635 2000
2 A 0 1.289980 0.477135 2001
3 A 0 -0.409960 0.863443 2002
4 B 1 -1.469220 0.324349 2010
5 B 0 0.372617 0.871734 2011
6 B 0 -0.047398 0.307596 2012
7 B 1 1.240108 0.667082 2013
8 B 0 0.558432 0.284363 2014
I am looking for a way to either group or split the DataFrame into chunks, where each chunk should contain
- one row with tag==1 and
- all rows where tag==0, row[year+1] and row[year-1] exist, row[[year+-1,"tag"]]==1 and row[[year+-1,"name"]]==row[[year,"name"]].
Simpy put, I want chunks of size 3 where the middle row is tagged and is surrounded by two untagged rows of the same company. So in the example above the only two chunks that pass these conditions are
name tag x1 x2 year
0 A 0 -1.352707 0.932559 1999
1 A 1 -1.359828 0.724635 2000
2 A 0 1.289980 0.477135 2001
and
7 B 0 -0.047398 0.307596 2012
8 B 1 1.240108 0.667082 2013
9 B 0 0.558432 0.284363 2014
I have thought about grouping by multiple columns but the problem is that the rows I need to group don't have anything in common but the name. I have also thought about manually introducing (in a for loop) another column that gives each chunk a new ID which I could then group by. I am however very unhappy with that approach as it does not seem either efficient nor elegant.
I'd appreciate any ideas.