0

My current code:

for row in range(df1.shape[0]):
    words = df1.iloc[row,11].split()
    df1.iloc[row,11] = (" ".join(sorted(set(words), key=words.index)))

What it does its remove duplicate Country code within a string in a pandas data frame column so that they only appear once in the order of the sentence E.g.

Countries
US CN US
US CN EU
US CN US EU
US US US US

To be:

Countries
US CN
US CN EU
US CN EU
US

As can be seen iterating through 400k rows of data and editing them is extremely slow. Average of 20 mins per dataset.

Hoping for any kind souls who could help me refine this further.

bumbie
  • 19
  • 2
  • At the first glance pandas expert (not me!) might suggest `apply` or something. – user202729 Feb 24 '21 at 04:55
  • What's the average number of words on a row? – user202729 Feb 24 '21 at 04:55
  • By the way, it's best if you describe what you're trying to do with the code. (remove duplicate of a string of words while keeping the original order) – user202729 Feb 24 '21 at 04:56
  • Searching for that gives https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order -- that method happens to be more efficient than the one you're using, so give it a try. – user202729 Feb 24 '21 at 04:56
  • The other is [python - How can I use the apply() function for a single column? - Stack Overflow](https://stackoverflow.com/questions/34962104/how-can-i-use-the-apply-function-for-a-single-column) -- with these two you should be able to write an answer yourself. – user202729 Feb 24 '21 at 04:57
  • try this : `df['Countries']= [*map(' '.join,df['Countries'].str.split().apply(pd.unique))]` – anky Feb 24 '21 at 05:52

4 Answers4

2

A combination of List Comprehension and df.apply should be faster:

In [448]: df.Countries = [' '.join(map(str, i)) for i in df.Countries.str.split().apply(lambda x: set(x))]

In [449]: df
Out[449]: 
  Countries
0     CN US
1  CN US EU
2  CN US EU
3        US
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • Thank you for the answer! Really appreicate it! But i first need it to appear in the order of the sentence! But ur answer provide me insights into what i could use to improve upon! Tyvm!! – bumbie Feb 24 '21 at 05:36
  • Yes, if the order of the strings don't matter to you, my answer's performance should be the best. – Mayank Porwal Feb 24 '21 at 05:38
2

Solutions based on apply are generally slow and should be avoided if there is a faster method to do the job.

Define a vectorized function:

@np.vectorize
def dropDupl(tt):
    return ' '.join(dict.fromkeys(tt.split()).keys())

Then call it:

df1.Countries = dropDupl(df1.Countries.values)

Try my solution and others on a greater source sample. In my opinion, my (vectorized) approach should be faster than any solution based on application of a function to each row.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
1

Try this and see if it's faster. It applies a helper function:

def helper(row):
    words = row['Countries'].split()
    row['Countries'] = (" ".join(sorted(set(words), key=words.index)))

df1.apply(helper, axis = 1)

Output:

  Countries
0     US CN
1  US CN EU
2  US CN EU
3        US
pakpe
  • 5,391
  • 2
  • 8
  • 23
  • 1
    You are welcome. Here are the rules of etiquette on Stack Overflow for accepting or upvoting an answer that has helped you. – pakpe Feb 24 '21 at 05:42
  • Helper functions only create more calls to the stack pointer and increase execution time. If speed is the concern, helper functions are not the answer. – Boskosnitch Feb 24 '21 at 16:31
  • @Boskosnitch you are coming to this thread late. In a previous post, now deleted, OP came up with an identical code to mine, tested it and found it to be very fast for his large dataset. Any (unproven) incremental improvement in performance by using a lambda instead of a named function is irrelevant to this particular problem. – pakpe Feb 24 '21 at 17:45
-1

Helper functions only increase the amount of calls to the stack pointer and function loads. I believe this should be the fastest solution so far:

df['Countries'] = df['Countries'].str.split(' ').apply(lambda x: ' '.join(list(set(x))))

Ex;

>>> df = pd.DataFrame(data={'Countries':['US CN US','US CN EU','US CN US EU','US']})
>>> df
     Countries
0     US CN US
1     US CN EU
2  US CN US EU
3           US
>>> df['Countries'] = df['Countries'].str.split(' ').apply(lambda x: ' '.join(list(set(x))))
>>> df
  Countries
0     US CN
1  US EU CN
2  US EU CN
3        US
Boskosnitch
  • 774
  • 3
  • 8