When using set in DataFrame to remove duplicate words in list, the words changed (not about the order)

Question

I am using set in a dataframe to remove duplicate words in a list, but the original words changed in the result.

these are the words shown in dataframe:

[Which, one, dissolve, in, water, quickly, sugar, ,, salt, ,, methane, and, carbon, di, oxide, ?]

note: words like 'sugar,' and 'salt,' are with comma

these are the result shown in dataframe after using set: {oxide, sugar, Which, di, water, in, ,, salt, carbon, dissolve, one, ?, methane, quickly, and}

data['sent1']=data['sent1'].apply(lambda x : set(x))

I want the words to keep the same order after using set. I really get puzzled why set will change the original words(form'sugar,'to'sugar')

no, I mean the set operation changes 'sugar, ' to 'sugar' in the result. note that there is a comma in the original words — jonathanschum, Aug 26 '19 at 07:35
doesn't sound like the behavior of a set operation. i think you need to double check your code — kerwei, Aug 26 '19 at 07:44

Ted · Answer 1 · 2019-08-26T08:07:00.653

If each row in your data frame looks like this:

data.loc[0, "sent1"] = ["Which", "one", "dissolve", "in", "water", "quickly", "sugar", ",", "salt", ",", "methane", "and", "carbon", "di", "oxide", "?"]

Then you could append the comma before applying the set operation, like:

data['sent1'] = data['sent1'].apply(lambda x: set([i + "," for i in x]))

On the other hand, f each row in `data['sent1']``is one long string of words:

data.loc[0, "sent1"] = ["Which", "one", "dissolve", "in", "water", "quickly", "sugar", ",", "salt", ",", "methane", "and", "carbon", "di", "oxide", "?"]

then try:

data['sent1'] = data['sent1'].apply(lambda x: set(x.split(" ")))

@jonathanschum Glad it helped and welcome to SO! Feel free to accept the answer and upvote. — Ted, Aug 26 '19 at 08:06

When using set in DataFrame to remove duplicate words in list, the words changed (not about the order)

1 Answers1