Python Pandas Split DF

Question

pls review the code below, is there a more efficient way of splitting one DF into two? In the code below, the query is run twice. Would it be faster to just run the query once, and basically say if true send to DF1, else to DF2 ; or maybe after DF1 is created, someway to say that DF2 = DF minus DF1

code:

x1='john'
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'}) 
df1= df[df.email.str.startswith(x1)]
df2= df[~df.email.str.startswith(x1)]

score 2 · Accepted Answer · edited Jun 13 '20 at 14:19

2

There's no need to compute the mask df.emailclean.str.startswith(x1) twice.

mask = df.emailclean.str.startswith(x1)
df1 = df[mask].copy() # in order not have SettingWithCopyWarning 
df2 = df[~mask].copy() # https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas

edited Jun 13 '20 at 14:19

BENY

317,841
20
164
234

answered Jun 13 '20 at 14:12

timgeb

76,762
20
123
145

kindly add copy at the end :-) – BENY Jun 13 '20 at 14:15
1

@YOBEN_S Good suggestion, but do we know if OP needs a copy? – timgeb Jun 13 '20 at 14:16
1

In this df1 , when setting a new column , it will have copy warning ~ , this is just my coding behavior ~ – BENY Jun 13 '20 at 14:17
@YOBEN_S I'm not entirely sure what you mean but feel free to edit your suggestion into my answer. – timgeb Jun 13 '20 at 14:18
Thanks both. how do I delete 'mask' from the memory after df1,df2 are split? I am trying to be as efficient as possible (big data) – rogerwhite Jun 13 '20 at 14:31
1

@rogerwhite With `del mask` the object will be garbage collected eventually if `mask` was the only reference. – timgeb Jun 13 '20 at 14:53

Python Pandas Split DF

1 Answers1