2

I have a dataframe with potentially millions of rows like the following:

df:
     name value
1     bob1   abc
2     bob3   def
3     jake2  ghi
4     jake   jkl 
5     sam1   mno
6     bob5   pqr

How can I split this into multiple dataframes based on the name column values containing some substring, such as 'bob', 'jake', and 'sam' in this example?

The new dataframes can be still kept in one data structure such as a dictionary if this changes anything.

Desired dataframes:

df1:
     name value
1     bob1   abc
2     bob3   def
3     bob5   pqr

df2:
     name value
1     jake2  ghi
2     jake   jkl 

df3:
     name value
1     sam1   mno
bwrabbit
  • 529
  • 1
  • 4
  • 25
  • What rule do you propose for splitting? Name (minus any integers)? – jpp Feb 14 '18 at 19:22
  • Yes precisely, name minus and trailing integers. I do not know the names ahead of time however, but they will always be letters , possibly containing an integer at the end. – bwrabbit Feb 14 '18 at 19:24

3 Answers3

1

here is another approach:

get all different values :

def matching_function(x):
    match = re.match(r"([a-z]+)([0-9]+)", x, re.I)
    if match:
        return match.group(1)

The function remove the mumber from string , thanks for this answer Get all possibles values of names :

set(df.name.apply(matching_function))

Loop to those values and split the df:

df_list= []
for x in set(df.name.apply(matching_function)):
    if x :
        df_list.append(df.loc[df.name.apply(lambda y : y.startswith( x ))])

df_list contains splited dataframes

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
0

This works. Note my dictionary keys are names, because that seemed most logical.

# get set of names
names = set(df.name.str.replace('\d+', ''))

# make dictionary
dfs = {n: df[df.name.str.replace('\d+', '') == n] for n in names}

# {'jake':     name value
# 3  jake2   ghi
# 4   jake   jkl,
#  'bob':    name value
# 1  bob1   abc
# 2  bob3   def
# 6  bob5   pqr,
#  'sam':    name value
# 5  sam1   mno}
jpp
  • 159,742
  • 34
  • 281
  • 339
0

IIUC

l=[y for _,y in df.groupby(df.name.str.replace('\d+', ''))]
Out[207]: 
l
[   name value
 1  bob1   abc
 2  bob3   def
 6  bob5   pqr,     name value
 3  jake2   ghi
 4   jake   jkl,    name value
 5  sam1   mno]
BENY
  • 317,841
  • 20
  • 164
  • 234