1

I have a text file separated by tabs and newlines. the first column contains sample IDs but these are duplicated:

1/16    info    info    info
1/16    info    info    info
2/16    info    info    info
2/16    info    info    info
2/16    info    info    info
3/16    info    info    info
3/16    info    info    info

I need to extract the first column of the IDs so I end up with a single column i.e-

1/16
2/16
3/16

I have managed to extract the column but I am having difficulty with removing the duplicates? Here is what I have:

path = ./Documents/*txt
for filename in glob.glob(path):
    my_file = open(filename, 'r+')
    for line in my_file:
        line = line.split('\t')
        id = line[0]
        print id

I have tried using another list and adding in the IDs and then

s=[]
if id not in s:
    s.append(id)

But i am stuck on how to remove the duplicates from here.

trouselife
  • 971
  • 14
  • 36
  • 2
    Possible duplicate of [How to remove duplicates from Python list and keep order?](http://stackoverflow.com/questions/479897/how-to-remove-duplicates-from-python-list-and-keep-order) – timgeb Feb 05 '16 at 10:33

3 Answers3

0

Hope I understand what you want, but you can remove duplicates from the list just with

list(set(foo))

for eample:

t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
list(set(t))
[1, 2, 3, 5, 6, 7, 8]
list(set(t) - set(s))
[8, 5, 6, 7]
ghovat
  • 1,033
  • 1
  • 12
  • 38
  • Thanks I understand this but the problem I am having is that my 'list' of IDs is not a list. it is a set of strings – trouselife Feb 05 '16 at 10:57
  • I thinks its much more easier if you work in this case with a list: In your for just append every tring to a list, after that you'll have a list of all ids and you can remove the duplicates with the code snippet up – ghovat Feb 05 '16 at 11:10
  • Yes I have tried this. However when i try: lst =[] lst.append(id) I get each ID in its own list? Like: [1/16] [2/16] etc? how can i change this? Thanks for your help :) – trouselife Feb 05 '16 at 11:16
  • Could you post your complete Code, becasue if I add the declaration of a empty list and instead of the print the append function in your snippet in your questions I get only one correct list. You could also try during the append: lst.append(str[0]) – ghovat Feb 05 '16 at 11:41
0

For file text processing (if you use linux) standard tool are better choice. In fact in your case you could use awk like.

# quick and dirty
import subprocess
def get_uniqid(path, suff):
    return set(subprocess.check_output(
        "awk '{print $1}' %s/*.%s | uniq" % (path, suff), shell=True).splitlines()) 

It will return the set of the id from the folder path with the suffix suff.

With you code just do

def get_ids():
    ids = []
    path = "./Documents/*txt"
    for filename in glob.glob(path):
        with open(filename, 'r') as fin:
            for line in fin:
                line = line.split('\t', maxsplit=2)
                id_ = line[0]
                if id_ not in ids:
                    ids.append(id_)
    return set(ids) # set removes duplicated not needed because of if id_ not in ids:
Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56
0

using sets and set comprehension assuming you have tabs as separators:

print ({element.split("\t")[0] for element in set(open("sample.txt").readlines())})

Output:

>>>>
{'2/16', '1/16', '3/16'}
Raju Pitta
  • 606
  • 4
  • 5