Pandas & python: split dataframe into many dataframes based on column value containing substring

Question

I have a dataframe with potentially millions of rows like the following:

df:
     name value
1     bob1   abc
2     bob3   def
3     jake2  ghi
4     jake   jkl 
5     sam1   mno
6     bob5   pqr

How can I split this into multiple dataframes based on the name column values containing some substring, such as 'bob', 'jake', and 'sam' in this example?

The new dataframes can be still kept in one data structure such as a dictionary if this changes anything.

Desired dataframes:

df1:
     name value
1     bob1   abc
2     bob3   def
3     bob5   pqr

df2:
     name value
1     jake2  ghi
2     jake   jkl 

df3:
     name value
1     sam1   mno

What rule do you propose for splitting? Name (minus any integers)? — jpp, Feb 14 '18 at 19:22
Yes precisely, name minus and trailing integers. I do not know the names ahead of time however, but they will always be letters , possibly containing an integer at the end. — bwrabbit, Feb 14 '18 at 19:24

score 1 · Answer 1 · answered Feb 14 '18 at 19:40

here is another approach:

get all different values :

def matching_function(x):
    match = re.match(r"([a-z]+)([0-9]+)", x, re.I)
    if match:
        return match.group(1)

The function remove the mumber from string , thanks for this answer Get all possibles values of names :

set(df.name.apply(matching_function))

Loop to those values and split the df:

df_list= []
for x in set(df.name.apply(matching_function)):
    if x :
        df_list.append(df.loc[df.name.apply(lambda y : y.startswith( x ))])

df_list contains splited dataframes

score 0 · Answer 2 · answered Feb 14 '18 at 19:27

This works. Note my dictionary keys are names, because that seemed most logical.

# get set of names
names = set(df.name.str.replace('\d+', ''))

# make dictionary
dfs = {n: df[df.name.str.replace('\d+', '') == n] for n in names}

# {'jake':     name value
# 3  jake2   ghi
# 4   jake   jkl,
#  'bob':    name value
# 1  bob1   abc
# 2  bob3   def
# 6  bob5   pqr,
#  'sam':    name value
# 5  sam1   mno}

score 0 · Answer 3 · answered Feb 14 '18 at 19:27

0

IIUC

l=[y for _,y in df.groupby(df.name.str.replace('\d+', ''))]
Out[207]: 
l
[   name value
 1  bob1   abc
 2  bob3   def
 6  bob5   pqr,     name value
 3  jake2   ghi
 4   jake   jkl,    name value
 5  sam1   mno]

answered Feb 14 '18 at 19:27

BENY

317,841
20
164
234

Pandas & python: split dataframe into many dataframes based on column value containing substring

3 Answers3

Linked