My current code:
for row in range(df1.shape[0]):
words = df1.iloc[row,11].split()
df1.iloc[row,11] = (" ".join(sorted(set(words), key=words.index)))
What it does its remove duplicate Country code within a string in a pandas data frame column so that they only appear once in the order of the sentence E.g.
Countries |
---|
US CN US |
US CN EU |
US CN US EU |
US US US US |
To be:
Countries |
---|
US CN |
US CN EU |
US CN EU |
US |
As can be seen iterating through 400k rows of data and editing them is extremely slow. Average of 20 mins per dataset.
Hoping for any kind souls who could help me refine this further.