0

Hello guys I am using RCV1 dataset. I want to remove duplicates words or tokens from the text file but I am not sure how to do it. And since these are not duplicate rows these are words in articles. I am using python, please help me with this.please see the attached image to get an idea about text file

  • Possible duplicate of [How might I remove duplicate lines from a file?](http://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file) – LuFFy Apr 29 '17 at 07:05

1 Answers1

0

Assuming that the words of the text file are spaced out with only a blank spaces (i.e., no attached commas and periods), the following code should work for you.

items = []
with open("data.txt") as f:
    for line in f:
        items += line.split()

newItemList = list(set(items))

If you would like to have the items as a single string:

newItemList = " ".join(list(set(items)))

If you want the order to be preserved as well, then do

newItemList = []
for item in items:
    if item not in newItemList:
        newItemList += [item]

newItemList = " ".join(newItemList)
Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
  • Hi this code is running really well but it is eliminating all the words that are similar and not keeping the one word from duplicate words. – subuktageen shaikh Apr 29 '17 at 07:45
  • @subuktageenshaikh, sorry but I don't get your objective, didn't you *want* to remove duplicates from your data? Could you give a sample (simple) input-output pair to explain what you require? – Ébe Isaac Apr 29 '17 at 07:56
  • @subuktageenshaikh ...and the expected output? – Ébe Isaac Apr 29 '17 at 08:02
  • @ebeIsaacI know my comments are confusing thanks for bearing with me here is a simple example; data = "low low low low high high different than than" my data is like this I want each word once in my data set, O/P should be = "low high different than" . The data set I am working on is RCV1 data set you might have came across it if not here is link for part one of the data [link](www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz) – subuktageen shaikh Apr 29 '17 at 08:11
  • @subuktageenshaikh: Should the ordering of the items be preserved for your purpose? – Ébe Isaac Apr 29 '17 at 08:14
  • @ebeIsaac--if there isnt much problem than it should be preserved otherwise i think it will work with no order too. – subuktageen shaikh Apr 29 '17 at 08:19
  • @subuktageenshaikh: Does the edited answer work for you? – Ébe Isaac Apr 29 '17 at 08:26
  • the first with no preserved order is working good but next with the preserved order is not working , but thank you so much for your help and giving me your precious time i think this code will work. – subuktageen shaikh Apr 29 '17 at 08:43
  • @subuktageenshaikh: Did you check the spelling of each term, especially `item` and `items`? It's working fine for me. (PS: If you really did find it helpful, you may consider accepting the answer). – Ébe Isaac Apr 29 '17 at 08:46
  • i am new to stackoverflow i am not sure how to accept the answer. – subuktageen shaikh Apr 29 '17 at 08:54
  • @subuktageenshaikh To accept an answer, you have to click on the tick mark below the number of votes next to the answer. – Ébe Isaac Apr 29 '17 at 08:55
  • @ebeisaac-- Hi I want to create a binary matrix with modified file. Can you help me how to do it. I want matrix like this (id number = row and and words = columns) and if the respective words lies in the ID number than 1 otherwise 0. – subuktageen shaikh May 01 '17 at 02:23
  • @subuktageenshaikh I believe that what you ask is a simple task but take some time to explain in words. I'm unavailable at the moment. If you cannot find a direct solution over the Internet. You may post another question (and share it via LinkedIn if required :-)). Be warned; most users in StackOverflow expect you to do some level of research before asking and show what you've tried. (PS: I simply upvoted your question so that the initial downvote would not lead to an avalanche of downvotes as some questions do). – Ébe Isaac May 02 '17 at 04:37