How to get unique values/elements of a column?

Question

I am trying to get the unique values of a column from a tab. The values are repeated and the file has 1,000+ lines, I just want to have the names of the values, not all, and the ones that are repeated. I'm working on my code, but when I do "RUN" it generates the separate and random letters of the values (see example in 'Output' below). I hope someone can help me find my mistake. Please and thank you very much!

Code:

# Open file
file = open('SGD_features.tab')

# Demonstrate the use of a data structure to represent all unique feature types (column 2).

# Iterate for just one iteration
for line in file:

      # Get rid of new lines at the end.
      line = line.strip()

      # File is tab-delimited.
      elems = line.split("\t")
      features = elems[1]
      unique_list = str(set(features))

      print(unique_list)

Output:

{'O', 'F', 'R'}
{'S', 'C', 'D'}
{'O', 'F', 'R'}
{'S', 'C', 'D'}
{'S', 'A', 'R'}
{'e', 'l', 'o', 'm', 'r', 't'}
{'e', 'l', 'i', 'p', 'a', 'o', 'm', 'r', '_', 't', 'c'}
{'X', 'e', 'l', 'm', '_', 't', 'n'}
{'X', 'e', 'b', 'l', 'i', 'p', 'a', 'o', 'm', 'r', '_', 't', 'c', 'n'}
{'O', 'F', 'R'}
{'S', 'C', 'D'}
{'O', 'F', 'R'}
{'S', 'C', 'D'}

And so on...

DESIRED OUTPUT:

ORF
CDS
ARS
telomere
telomeric_repeat
X_element
X_element_combinatorial_repeat

EX. FILE

S000036595  noncoding_exon                  snR18       1   142367  142468  W       2011-02-03  2000-05-19|2007-05-08   
S000000002  ORF Verified    YAL002W VPS8    CORVET complex membrane-binding subunit VPS8|VPL8|VPT8|FUN15    chromosome 1    L000003013  1   143707  147531  W       2011-02-03  2004-01-14|1996-07-31   Membrane-binding component of the CORVET complex; involved in endosomal vesicle tethering and fusion in the endosome to vacuole protein targeting pathway; interacts with Vps21p; contains RING finger motif
S000031737  CDS                 YAL002W     1   143707  147531  W       2011-02-03  2004-01-14|1996-07-31   
S000121255  ARS     ARS108      ARSI-147    chromosome 1        1   147398  147717          2014-11-18  2014-11-18|2007-03-07   Autonomously Replicating Sequence
S000000001  ORF Verified    YAL001C TFC3    transcription factor TFIIIC subunit TFC3|tau 138|TSV115|FUN24   chromosome 1    L000000641|L000002287   1   151166  147594  C   -1  2011-02-03  1996-07-31  Subunit of RNA polymerase III transcription initiation factor complex; part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; cooperates with Tfc6p in DNA binding; largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC)
S000030735  CDS                 YAL001C     1   151006  147594  C       2011-02-03  1996-07-31  
S000030734  CDS                 YAL001C     1   151166  151097  C       2011-02-03  1996-07-31  
S000030736  intron                  YAL001C     1   151096  151007  C       2011-02-03  1996-07-31

Please add a short example of the `SGD_features.tab` file's contents and the desired output from processing *it* to your question. — martineau, Apr 27 '21 at 02:04

Barmar · Answer 1 · 2021-04-27T14:11:59.237

1

features is just one string in one line of the file, not all the strings in that column.

Add each word to the unique_list set in the loop, and print the set at the end.

unique_list = set()
for line in file:
    line = line.strip()
    unique_list.add(line.split('\t')[1])

print(unique_list)

edited Apr 27 '21 at 14:11

answered Apr 27 '21 at 01:03

Barmar

741,623
53
500
612

Thanks, this worked for me in part, because there are certain data such as "not physically mapped" that contains space between them, and with this help you gave me, it only gives me the "not" and the "physically mapped" is left out. * I updated my question. – Valy1004 Apr 27 '21 at 13:18
1

I forgot the `\t` delimiter in `split()` – Barmar Apr 27 '21 at 14:12
Thanks for your time and help! – Valy1004 Apr 27 '21 at 21:24

oreopot · Answer 2 · 2021-04-27T01:13:59.800

1

Try the following :

Replace your dollowing line of code:

unique_list = str(set(features))

with the following:

unique_list = ' '.join(set(features))

edited Apr 27 '21 at 01:13

answered Apr 27 '21 at 01:07

oreopot

3,392
2
19
28

Thanks for the help, but it didn't work, I get the same. I updated my question. – Valy1004 Apr 27 '21 at 13:17

score 1 · Accepted Answer · answered Apr 27 '21 at 15:49

1

If order doesn't matter, you could do it by creating a set from the items in column 2 of the lines in the file:

with open('SGD_features.tab') as file:
    unique_features = set(line.split('\t')[1] for line in file)

for feature in unique_features:
    print(feature)

answered Apr 27 '21 at 15:49

martineau

119,623
25
170
301

And how would it be to put it in order? It doesn't matter if it is in alphabetical order or in order of appearance. I tried with `unique_features.sorted()`, but it doesn't work for me, it keeps printing random, different orders. I'm sorry for the inconvenience, I'm new in this Python world, in fact, thanks, this answer worked for me. – Valy1004 Apr 27 '21 at 21:23
Ok thanks. I tried with the `sorted(unique_features)` but it keeps printing randomly. But it's okay I will keep trying. Thank you so much for your help and time! – Valy1004 Apr 27 '21 at 23:40
1

Valy1004: Hmm, that's *very* strange and shouldn't happen if the `set` contains the same elements. Are you doing a `print(sorted(unique_features))`? You could make it permanent using `unique_features = sorted(unique_features)`. There's a highly rated third-party module named [`sortedcontainers`](https://pypi.org/project/sortedcontainers/) you could try that has a `SortedSet` — although frankly I'd expect the same results. – martineau Apr 27 '21 at 23:49
Oh! It just works with `unique_features = sorted(unique_features)` I think that I had a little confusion with the variables and the functions, thanks a lot! Really appreciated! – Valy1004 Apr 28 '21 at 00:04
1

Indeed, it can be confusing that the `list.sort()` **method** changes the container in-place and doesn't return anything whereas the `sorted()` **function** creates and returns a new `list` object (but can be applied to anything iterable, such as a `set`). – martineau Apr 28 '21 at 00:24
Sorry, I'm not sure I understand your follow-on question — so you should probably formally ask a new one (and give examples of input and desired output). Offhand, it sounds like maybe you could use a dictionary that had feature types as keys that were each associated with a `list` of feature names. Something like that should to be relatively easy to implement, too. I'm sure virtually every Python tutorial in existence has information on dictionaries and examples of using them, since they're such an important part of the language. – martineau Apr 28 '21 at 02:40
Hello! I just posted the new question, so I think now I explain myself much better. Thanks for the help! – Valy1004 Apr 29 '21 at 01:25

How to get unique values/elements of a column?

3 Answers3