1

pls review the code below, is there a more efficient way of splitting one DF into two? In the code below, the query is run twice. Would it be faster to just run the query once, and basically say if true send to DF1, else to DF2 ; or maybe after DF1 is created, someway to say that DF2 = DF minus DF1

code:

x1='john'
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'}) 
df1= df[df.email.str.startswith(x1)]
df2= df[~df.email.str.startswith(x1)]
rogerwhite
  • 335
  • 4
  • 16

1 Answers1

2

There's no need to compute the mask df.emailclean.str.startswith(x1) twice.

mask = df.emailclean.str.startswith(x1)
df1 = df[mask].copy() # in order not have SettingWithCopyWarning 
df2 = df[~mask].copy() # https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
BENY
  • 317,841
  • 20
  • 164
  • 234
timgeb
  • 76,762
  • 20
  • 123
  • 145