2

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).

csv file example (userID and description are the columns name):

userID, description

12, hello world hello world

13, I will keep the 2000 followers same I will keep the 2000 followers same

14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car

.

.

I would like to have the output as:

userID, description

12, hello world 

13, I will keep the 2000 followers same

14, I paid $2000 to the car 

.

.

I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn. Thank you

[I am using python 3.7 version]

Bilgin
  • 499
  • 1
  • 10
  • 25

4 Answers4

2

To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:

df['Desired'] = (df['Current'].str.split()
                          .apply(lambda x: OrderedDict.fromkeys(x).keys())
                          .str.join(' '))
Display name
  • 753
  • 10
  • 28
0

The code below works for me:

a = pd.Series(["hello world hello world", 
               "I will keep the 2000 followers same I will keep the 2000 followers same",
               "I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))

Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.

This will give you:

0                            hello world
1    I will keep the 2000 followers same
2                I paid $2000 to the car
dtype: object
TYZ
  • 8,466
  • 5
  • 29
  • 60
  • 1
    Your code would give a wrong answer to "I will keep the 2000 followers the same I will keep the 2000 followers the same" (notice that there are 2 `the`'s in the sentence.) – Quang Hoang May 27 '19 at 21:02
  • @Yilun Zhang Thanks for your comments and code. If I modify your code as: `data = pd.read_csv('someCSv.csv', error_bad_lines=False); a = pd.Series(data) a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))` I am getting `ValueError: Wrong number of items passed 2`. Can you modify your code to do with input csv and then save it to csv. (I am new in python!) – Bilgin May 27 '19 at 21:10
  • @Bilgin instead of `a=pd.Series(data)`, try `data['description']` – Ricky Kim May 27 '19 at 21:30
  • @QuangHoang Good point, I will need to revisit it to see how to solve it. – TYZ May 28 '19 at 12:46
0

Solution taken from here:

def principal_period(s):
    i = (s+s).find(s, 1)
    return s[:i]

df['description'].apply(principal_period)

Output:

0                                 hello world
1     I will keep the 2000 followers the same
2                     I paid $2000 to the car
Name: description, dtype: object

Since this uses apply on string, it might be slow.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks for your comment. I tried your code as : ` import pandas as pd data = pd.read_csv('some.csv', error_bad_lines=False); def principal_period(s): i = (s+s).find(s, 1, -1) return None if i == -1 else s[:i] k = data['description'].apply(principal_period) ` and i am getting output as: ` 0 None 1 None 2 None Name: description, dtype: object`. Can you please modify your code in a wat that takes the input as csv and output csv. (i am new in python) thanks – Bilgin May 27 '19 at 21:37
  • Sorry, check update. The function should just `return s[:i]` – Quang Hoang May 27 '19 at 21:39
  • thanks for your code. Now i am using `import pandas as pd data = pd.read_csv('input.csv', error_bad_lines=False); def principal_period(s): i = (s+s).find(s, 1, -1) return s[:i] k = data['description'].apply(principal_period)` but it is not removing dublicates in my side. Can you please check my code. thanks – Bilgin May 27 '19 at 21:46
0

Answer taken from How can I tell if a string repeats itself in Python?

import pandas as pd
def principal_period(s):
    s+=' '
    i = (s + s).find(s, 1, -1)
    return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')

Explanation:

I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

Ricky Kim
  • 1,992
  • 1
  • 9
  • 18
  • While this code snippet may solve the problem, it doesn't explain why or how it answers the question. Please include an explanation for your code, as that really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Bryce Howitson May 30 '19 at 14:24
  • @BryceHowitson added explanation – Ricky Kim May 30 '19 at 14:45