Split DataFrame into chunks

Question

I have a DataFrame that contains an name, a year, a tag and a bunch of other variables. So it might look like this

df = pd.DataFrame({
    "name": 4*["A"] + 5*["B"],
    "year": [1999,2000,2001,2002,2010,2011,2012,2013,2014],
    "tag": [0,1,0,0,1,0,0,1,0],
    "x1": np.random.normal(size=9),
    "x2": np.random.uniform(size=9)
})

print df

  name  tag        x1        x2  year
0    A    0 -1.352707  0.932559  1999
1    A    1 -1.359828  0.724635  2000
2    A    0  1.289980  0.477135  2001
3    A    0 -0.409960  0.863443  2002
4    B    1 -1.469220  0.324349  2010
5    B    0  0.372617  0.871734  2011
6    B    0 -0.047398  0.307596  2012
7    B    1  1.240108  0.667082  2013
8    B    0  0.558432  0.284363  2014

I am looking for a way to either group or split the DataFrame into chunks, where each chunk should contain

one row with tag==1 and
all rows where tag==0, row[year+1] and row[year-1] exist, row[[year+-1,"tag"]]==1 and row[[year+-1,"name"]]==row[[year,"name"]].

Simpy put, I want chunks of size 3 where the middle row is tagged and is surrounded by two untagged rows of the same company. So in the example above the only two chunks that pass these conditions are

  name  tag        x1        x2  year
0    A    0 -1.352707  0.932559  1999
1    A    1 -1.359828  0.724635  2000
2    A    0  1.289980  0.477135  2001

and

7    B    0 -0.047398  0.307596  2012
8    B    1  1.240108  0.667082  2013
9    B    0  0.558432  0.284363  2014

I have thought about grouping by multiple columns but the problem is that the rows I need to group don't have anything in common but the name. I have also thought about manually introducing (in a for loop) another column that gives each chunk a new ID which I could then group by. I am however very unhappy with that approach as it does not seem either efficient nor elegant.

I'd appreciate any ideas.

Your question is a bit unclear about how you want the data. Do you want syntax that returns one of the chunks or do you want all chunks in one `DataFrame`? — sobek, Sep 22 '18 at 19:22
I would like code that either returns the next chunk or a DataFrame of all chunks. — Lxndr, Sep 22 '18 at 19:25

Scott Boston · Accepted Answer · 2018-09-22T20:03:23.803

Let's try this bit of logic:

df = pd.DataFrame({
    "name": 4*["A"] + 5*["B"],
    "year": [1999,2000,2001,2002,2010,2011,2012,2013,2014],
    "tag": [0,1,0,0,1,0,0,1,0],
    "x1": np.random.normal(size=9),
    "x2": np.random.uniform(size=9)
})

grp = df.groupby(['name',
                df.tag.cumsum().rolling(3, center=True, min_periods=1).max()])

chunks_df = {}
for n, g in grp:
    if g.shape[0] >= 3:
        chunks_df[n] = g

Where chunks_df is a dictionary of your broken up data frame:

chunks_df[('A', 1.0)]

  name  year  tag        x1        x2
0    A  1999    0 -0.015852  0.553314
1    A  2000    1  0.367290  0.245546
2    A  2001    0  0.605592  0.524358

chunks_df[('B', 3.0)]

  name  year  tag        x1        x2
6    B  2012    0 -0.750010  0.432032
7    B  2013    1 -0.682009  0.971042
8    B  2014    0  1.066113  0.179048

Details:

Use cumsum to unique identify/label each tag == 1.
Use rolling with a window of 3 and get the max of that centered window, to select -1, 1, and +1.

score 1 · Answer 2 · answered Sep 23 '18 at 10:20

Although @ScottBoston's answer works great for the DataFrame I gave in the question, it does not work in cases where a year is missing. So for example in the case of

df = pd.DataFrame({
    "name": 4*["A"] + 6*["B"],
    "year": [1999,2000,2001,2002,2008,2010,2011,2012,2013,2014],
    "tag": [0,1,0,0,0,1,0,0,1,0],
    "x1": np.random.normal(size=10),
    "x2": np.random.uniform(size=10)
})  


print df

  name  tag        x1        x2  year
0    A    0 -0.387840  0.729721  1999
1    A    1 -0.112094  0.813332  2000
2    A    0  0.913186  0.115521  2001
3    A    0 -1.088056  0.983111  2002
4    B    0  0.037521  0.743706  2008
5    B    1  0.602878  0.007256  2010
6    B    0 -0.340498  0.961602  2011
7    B    0  0.170654  0.293789  2012
8    B    1  0.973555  0.942687  2013
9    B    0 -0.643503  0.133091  2014

the code would give

grp = df.groupby(['name',
                df.tag.cumsum().rolling(3, center=True, min_periods=1).max()])

chunks_df = {}
for n, g in grp:
    if g.shape[0] >= 3:
        chunks_df[n] = g
        print n
        print g, "\n"    


('A', 1.0)
  name  tag        x1        x2  year
0    A    0 -0.387840  0.729721  1999
1    A    1 -0.112094  0.813332  2000
2    A    0  0.913186  0.115521  2001
3    A    0 -1.088056  0.983111  2002 

('B', 2.0)
  name  tag        x1        x2  year
4    B    0  0.037521  0.743706  2008
5    B    1  0.602878  0.007256  2010
6    B    0 -0.340498  0.961602  2011 

('B', 3.0)
  name  tag        x1        x2  year
7    B    0  0.170654  0.293789  2012
8    B    1  0.973555  0.942687  2013
9    B    0 -0.643503  0.133091  2014

which shows that the size of the first chunk is wrong and the second chunk should not be there according to the second condition in the original question (years are 2008, 2010 and 2011).

The two man issues are

The question explicitly keeps open the possibility for a row to be in more than one chunk, so one additional index can't generally be enough.
The condition on years must be included, so the rolling calculation needs to be on two columns (tag and year) at the same time, which is currently not supported by pandas according to https://stackoverflow.com/a/37491779/2336654.

So my workaround now is the following

def rolling(df, func, window_size=3):
    dxl = int(window_size/2)    
    if window_size % 2 == 0:
        dxu = dxl
    else:
        dxu = dxl+1
    xmin = dxl
    xmax = len(df)-dxu+1

    for i in xrange(xmin,xmax):
        chunk = df.iloc[i-dxl:i+dxu,:]
        if func(chunk):
            yield chunk



def valid(chunk):
    if len(chunk.name.value_counts()) != 1:
        return False
    if chunk.tag.iloc[1] != 1:
        return False
    if chunk.year.iloc[2]-chunk.year.iloc[0] != 2:
        return False
    return True



new_df = pd.DataFrame()
for ichunk, chunk in enumerate(rolling(df, window_size=3, func=valid)):
    new_df = new_df.append(chunk.assign(new_tag=ichunk), ignore_index=True)

for name, g in new_df.groupby(["name","new_tag"]):
    print name
    print g,"\n"


('A', 0)
  name  tag        x1        x2  year  new_tag
0    A    0 -1.046241  0.692206  1999        0
1    A    1  0.373060  0.919130  2000        0
2    A    0  1.316474  0.463517  2001        0 

('B', 1)
  name  tag        x1        x2  year  new_tag
3    B    0  0.376408  0.743188  2012        1
4    B    1  0.019062  0.647851  2013        1
5    B    0 -0.442368  0.506169  2014        1

Just thought I should add this in case anyone in the future wonders why the accepted answer doesn't work for a similar problem.

Split DataFrame into chunks

2 Answers2