From a text file I want to extract a single column and remove duplicates, ending up with a column with unique strings - python

Question

I have a text file separated by tabs and newlines. the first column contains sample IDs but these are duplicated:

1/16    info    info    info
1/16    info    info    info
2/16    info    info    info
2/16    info    info    info
2/16    info    info    info
3/16    info    info    info
3/16    info    info    info

I need to extract the first column of the IDs so I end up with a single column i.e-

1/16
2/16
3/16

I have managed to extract the column but I am having difficulty with removing the duplicates? Here is what I have:

path = ./Documents/*txt
for filename in glob.glob(path):
    my_file = open(filename, 'r+')
    for line in my_file:
        line = line.split('\t')
        id = line[0]
        print id

I have tried using another list and adding in the IDs and then

s=[]
if id not in s:
    s.append(id)

But i am stuck on how to remove the duplicates from here.

Possible duplicate of [How to remove duplicates from Python list and keep order?](http://stackoverflow.com/questions/479897/how-to-remove-duplicates-from-python-list-and-keep-order) — timgeb, Feb 05 '16 at 10:33

ghovat · Accepted Answer · 2019-11-01T14:39:53.600

0

Hope I understand what you want, but you can remove duplicates from the list just with

list(set(foo))

for eample:

t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
list(set(t))
[1, 2, 3, 5, 6, 7, 8]
list(set(t) - set(s))
[8, 5, 6, 7]

edited Nov 01 '19 at 14:39

answered Feb 05 '16 at 10:25

ghovat

1,033
1
12
38

Thanks I understand this but the problem I am having is that my 'list' of IDs is not a list. it is a set of strings – trouselife Feb 05 '16 at 10:57
I thinks its much more easier if you work in this case with a list: In your for just append every tring to a list, after that you'll have a list of all ids and you can remove the duplicates with the code snippet up – ghovat Feb 05 '16 at 11:10
Yes I have tried this. However when i try: lst =[] lst.append(id) I get each ID in its own list? Like: [1/16] [2/16] etc? how can i change this? Thanks for your help :) – trouselife Feb 05 '16 at 11:16
Could you post your complete Code, becasue if I add the declaration of a empty list and instead of the print the append function in your snippet in your questions I get only one correct list. You could also try during the append: lst.append(str[0]) – ghovat Feb 05 '16 at 11:41

score 0 · Answer 2 · answered Feb 05 '16 at 11:55

For file text processing (if you use linux) standard tool are better choice. In fact in your case you could use awk like.

# quick and dirty
import subprocess
def get_uniqid(path, suff):
    return set(subprocess.check_output(
        "awk '{print $1}' %s/*.%s | uniq" % (path, suff), shell=True).splitlines())

It will return the set of the id from the folder path with the suffix suff.

With you code just do

def get_ids():
    ids = []
    path = "./Documents/*txt"
    for filename in glob.glob(path):
        with open(filename, 'r') as fin:
            for line in fin:
                line = line.split('\t', maxsplit=2)
                id_ = line[0]
                if id_ not in ids:
                    ids.append(id_)
    return set(ids) # set removes duplicated not needed because of if id_ not in ids:

score 0 · Answer 3 · answered Sep 21 '17 at 11:22

0

using sets and set comprehension assuming you have tabs as separators:

print ({element.split("\t")[0] for element in set(open("sample.txt").readlines())})

Output:

>>>>
{'2/16', '1/16', '3/16'}

answered Sep 21 '17 at 11:22

Raju Pitta

606
4
5

From a text file I want to extract a single column and remove duplicates, ending up with a column with unique strings - python

3 Answers3