0

I wrote code that looks for all 'Contact' (grouped by Name) that is done via email. It's done with .isin(), then, extracts the True booleans to create a new dataframe. Is there a faster and simpler way for doing this?

df = pd.DataFrame({'Name':['adam','ben','ben','ben','adam','adam','adam'],
                   'Date':['2014-06-01 18:47:05.069722','2014-06-01 18:47:05.069722','2014-06-30 13:47:05.069722',
                      '2013-06-01 18:47:05.069722','2014-01-01 18:47:05.06972','2014-06-01 18:47:05.06972',
                      '2014-06-02 18:47:05.06972'], 
                   'Contact':['phone','email','email','email','email','email','Nan']})

"""Pull only those rows where form of Contact is 'email', to construct new dataframe"""

 emails = df.groupby('Name')['Contact'].apply(lambda i: i.isin(['email']))
 a = list(np.where(email))  #create list of indices of True booleans 
 lst = a[0]
 df = df.iloc[lst, :] #new dataframe
Adam Schroeder
  • 748
  • 2
  • 9
  • 23

2 Answers2

1

You could in fact use this with loc and boolean indexing

df = df.loc[df.Contact == "email"]

or even a bit faster using str.replace as follows: as mentionned by @Sergey Bushmanov

df = df.loc[df.Contact.str.contains("email")]

which gives the exact same output, quite faster if you're using big set of data and a lost simpler i believe.

Vectorized methods are always faster than apply.

you could also refer to this link for more information about speed and performance of pandas methods.

other documentation about enhancing performance.

Rayhane Mama
  • 2,374
  • 11
  • 20
  • Thank you Rayhane. This is definitely a lot cleaner and quicker. I appreciate it. What does Vectorized methods mean? Why is .loc[] called vectorized? – Adam Schroeder Jul 09 '17 at 20:37
  • @AdamSchroeder, you're welcome, also you'll see i added another documentation if you're interested. vectorized methods are basically those that do not require you to loop over the rows explicitly but will apply on all of them automatically. the documentation I provided is a lot more detailed i strongly recommend checking them – Rayhane Mama Jul 09 '17 at 20:42
1

For the sake of completeness:

df = df.loc[df.Contact.str.contains("email")]

Runtime:

%timeit df.loc[df.Contact.str.contains("email")]
646 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df.Contact == "email"]
750 µs ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PS

str methods for string manipulations are usually optimized for dealing with text. For big DF's, the time difference will be even bigger.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72