using Python how to remove redundancy from rows of text file

Question

Hello guys I am using RCV1 dataset. I want to remove duplicates words or tokens from the text file but I am not sure how to do it. And since these are not duplicate rows these are words in articles. I am using python, please help me with this.please see the attached image to get an idea about text file

Possible duplicate of [How might I remove duplicate lines from a file?](http://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file) — LuFFy, Apr 29 '17 at 07:05

Ébe Isaac · Accepted Answer · 2017-04-29T08:39:45.307

0

Assuming that the words of the text file are spaced out with only a blank spaces (i.e., no attached commas and periods), the following code should work for you.

items = []
with open("data.txt") as f:
    for line in f:
        items += line.split()

newItemList = list(set(items))

If you would like to have the items as a single string:

newItemList = " ".join(list(set(items)))

If you want the order to be preserved as well, then do

newItemList = []
for item in items:
    if item not in newItemList:
        newItemList += [item]

newItemList = " ".join(newItemList)

edited Apr 29 '17 at 08:39

answered Apr 29 '17 at 07:13

Ébe Isaac

11,563
17
64
97

Hi this code is running really well but it is eliminating all the words that are similar and not keeping the one word from duplicate words. – subuktageen shaikh Apr 29 '17 at 07:45
@subuktageenshaikh, sorry but I don't get your objective, didn't you *want* to remove duplicates from your data? Could you give a sample (simple) input-output pair to explain what you require? – Ébe Isaac Apr 29 '17 at 07:56
@subuktageenshaikh ...and the expected output? – Ébe Isaac Apr 29 '17 at 08:02
@ebeIsaacI know my comments are confusing thanks for bearing with me here is a simple example; data = "low low low low high high different than than" my data is like this I want each word once in my data set, O/P should be = "low high different than" . The data set I am working on is RCV1 data set you might have came across it if not here is link for part one of the data [link](www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz) – subuktageen shaikh Apr 29 '17 at 08:11
@subuktageenshaikh: Should the ordering of the items be preserved for your purpose? – Ébe Isaac Apr 29 '17 at 08:14
@ebeIsaac--if there isnt much problem than it should be preserved otherwise i think it will work with no order too. – subuktageen shaikh Apr 29 '17 at 08:19
@subuktageenshaikh: Does the edited answer work for you? – Ébe Isaac Apr 29 '17 at 08:26
the first with no preserved order is working good but next with the preserved order is not working , but thank you so much for your help and giving me your precious time i think this code will work. – subuktageen shaikh Apr 29 '17 at 08:43
@subuktageenshaikh: Did you check the spelling of each term, especially `item` and `items`? It's working fine for me. (PS: If you really did find it helpful, you may consider accepting the answer). – Ébe Isaac Apr 29 '17 at 08:46
i am new to stackoverflow i am not sure how to accept the answer. – subuktageen shaikh Apr 29 '17 at 08:54
@subuktageenshaikh To accept an answer, you have to click on the tick mark below the number of votes next to the answer. – Ébe Isaac Apr 29 '17 at 08:55
@ebeisaac-- Hi I want to create a binary matrix with modified file. Can you help me how to do it. I want matrix like this (id number = row and and words = columns) and if the respective words lies in the ID number than 1 otherwise 0. – subuktageen shaikh May 01 '17 at 02:23
@subuktageenshaikh I believe that what you ask is a simple task but take some time to explain in words. I'm unavailable at the moment. If you cannot find a direct solution over the Internet. You may post another question (and share it via LinkedIn if required :-)). Be warned; most users in StackOverflow expect you to do some level of research before asking and show what you've tried. (PS: I simply upvoted your question so that the initial downvote would not lead to an avalanche of downvotes as some questions do). – Ébe Isaac May 02 '17 at 04:37

using Python how to remove redundancy from rows of text file

1 Answers1