1

I have a pandas dataframe, within the dataframe I have two series/columns that I wish to combine into a new series/column. I already have a for loop that does what I need but I'd rather it be in a list comprehension but I cannot figure it out. Also my code takes a considerable amount of time to execute. I read that list comprehensions run quicker, maybe there is a quicker way?

If the values from 'lead_owner' match the distinct/unique values from 'agent_final' use that value. Otherwise use the values from 'agent_final'

for x, y in zip(list(df['lead_owner']), list(df['agent_final'])):
    if x in set(df['agent_final']):
        my_list .append(x)
    else:
        my_list .append(y)
Ryan Davies
  • 446
  • 1
  • 4
  • 13
  • 1
    did you try `df['concatenated_col'] = df['lead_owner'] + df['agent_final']` – ksha Sep 30 '19 at 12:41
  • looks like you want the intersection pluts the agent list. check this out: [SO Answer](https://stackoverflow.com/questions/18079563/finding-the-intersection-between-two-series-in-pandas) – lwileczek Sep 30 '19 at 12:43
  • I don't want them concatenated. If the values from 'lead_owner' match the distinct/unique values from 'agent_final' use that value. Otherwise use the values from 'agent_final'. – Ryan Davies Sep 30 '19 at 12:45
  • 1
    Can you post some sample data? – Chris Sep 30 '19 at 12:46

4 Answers4

2

The way to do this using list comprehension:

my_list = [x if x in set(df['agent_final']) else y for (x,y) in zip(list(df['lead_owner']), list(df['agent_final']))]

It's pretty hard to say why your code is running slow, unless I know what the size of your data is.

One way to speed up your code for sure is to not construct the set every time you check if x is in the set. Construct the set outside of the for loop/ list comprehension:

agent_final_set = set(df['agent_final'])
my_list = [x if x in agent_final_set else y for (x,y) in zip(list(df['lead_owner']), list(df['agent_final']))]
Nico Griffioen
  • 5,143
  • 2
  • 27
  • 36
1

I removed some unnecessary code and extracted the creation of the set outside of the main loop. Let's see if this runs faster:

agents = set(df['agent_final'])
data = zip(df['lead_owner'], df['agent_final'])
result = [x if x in agents else y for x, y in data]
Óscar López
  • 232,561
  • 37
  • 312
  • 386
1

I would suggest your try pandas apply and share performance :

agents = set(df['agent_final'])
df['result'] = df.apply(lambda x: x['lead_owner'] if x['lead_owner'] in agents else x['agent_final'], axis=1)

and do a to_list if required

ksha
  • 2,007
  • 1
  • 19
  • 22
0

With numpy.where one-liner:

my_list = np.where(df.lead_owner.isin(df.agent_final), df.lead_owner, df.agent_final)

Simple example:

In [284]: df
Out[284]: 
  lead_owner agent_final
0          a           1
1          b           2
2          c           a
3          e           c

In [285]: np.where(df.lead_owner.isin(df.agent_final), df.lead_owner, df.agent_final)
Out[285]: array(['a', '2', 'c', 'c'], dtype=object)
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105