Get unique elements after parsing a column from a file in python

Question

I am working with Python 3.6. I have a tsv file which consists of 5 columns and > 100k of rows. I have used the split function to parse the file by the delimiter from which I receive specific columns with indices. The column which I am working on looks like this:

CSF3R
DNMT3A
DNMT3A
DNMT3A
DNMT3A
CBLB
PDGFRA
KIT
TET2
TET2
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
EZH2
EZH2
RAD21
ABL1
NOTCH1
NOTCH1
ETV6
ETV6
ETV6
FLT3
FLT3
TP53
TP53

What I need to do is get all the unique elements present in this column and print only one of them. I have tried plenty of functions such as join, set, tried to follow many other stackoverflow posts but none of them really solve my problem.

Plus the data I receive is in the 'str' format and not in a list. Therefore I tried getting all of them in a list as well, failing to do that as well. I cannot work with python pandas because all of my fellow associates do not have any idea on the package.

Therefore the normal procedure oriented code I tried is:

file=open('filename.txt')
next(file)
    stripped=()
    pos=()
    s="-"

    for line in file:
        stripped=line.strip()
        pos=stripped.split("\t")

        pos[2]= [y for y in (x.strip() for x in pos[2].splitlines()) if y]
        print(pos[2])

The output gives a list of all of the strings, that is each string is enclosed in a separate list and not in a single list.

From the above list, my expected output is:

CSF3R
DNMT3A
CBLB
PDGFRA
KIT
TET2
CUX1
EZH2
RAD21
ABL1
NOTCH1
ETV6
FLT3
TP53

This gives me only the unique elements.

To get the unique elements, do I have to get all of these in a single list or is there any other better way to do that?

The file that I am working on.

@Zdar : As to what I have understood, to use the `set` function, you need a list. The problem is I won't get a list of all of these. I had referred to this link - http://stackoverflow.com/questions/12897374/get-unique-values-from-a-list-in-python — Srk, May 11 '17 at 06:29
If your fellow associates are familiar with NumPy, there is: [`numpy.unique`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html) — tuomastik, May 11 '17 at 06:30
@numpy.unique - Well none of them are really familiar with any extra packages other than the default ones and neither am I. — Srk, May 11 '17 at 06:34
@Srk If you could, spend some time to have a look at pandas I think, any package I think is easy to learn as a lot of tutorial was created, and the best thing is they handle lots of thing in a neat way, efficient too so you could work easily — Phung Duy Phong, May 11 '17 at 06:39
So you're saying you want a unique output of the genes column, correct? — pylang, May 11 '17 at 07:20

score 1 · Accepted Answer · edited May 23 '17 at 11:54

1

From this answer reading a text file columnwise and storing in a list in python:

with open('test.txt', 'r') as file:
    rows = [[str(x) for x in line.split('\t')] for line in file]
    cols = [list(col) for col in zip(*rows)]

for i in cols:
    print(set(i))

edited May 23 '17 at 11:54

Community

1
1

answered May 11 '17 at 06:30

Phung Duy Phong

876
6
18

1

The author mentioned that he wants to avoid Pandas. – tuomastik May 11 '17 at 06:31
Oops sorry, I will try to fix and delete for now – Phung Duy Phong May 11 '17 at 06:32
@PhongPhung : I got the answer. I get a list of the exact unique elements from the column! And by manipulating a few commands, I can make different operations as well to your solution! Big Big thank you to you! – Srk May 11 '17 at 07:26

pylang · Answer 2 · 2017-05-11T07:51:19.053

1

filename = "path/to/Post.txt"

with open(filename) as f:
    header = next(f)
    col = 2                                                # gene column
    unique_genes = {line.split()[col] for line in f.readlines()}

print(unique_genes)
# {'KIT', 'PDGFRA', 'CUX1', 'CBLB', 'DNMT3A', 'RAD21', 'CSF3R', 'NOTCH1', 'GENE', 'ABL1', 'TET2', 'EZH2'}

The steps for getting unique items from the 3rd column in your data is as follows:

Open the file (with)
Skip the header (next())
Iterate over the rows of the file (readlines)
Split the lines by the default delimiter e.g. tab (\t)
Only extract data from the third column ([col])
Return unique values on the extracted data (set comprehension, {...}).
Safely close the file (with)

Select a different column by changing the col value.

edited May 11 '17 at 07:51

answered May 11 '17 at 06:40

pylang

40,867
14
129
121

This gives a list separate list of the characters, that is tried it and got the output as - `{'F', 'R', 'C', '3', 'S'} {'D', '3', 'N', 'A', 'M', 'T'}` – Srk May 11 '17 at 06:47
Give an example of the output please. – pylang May 11 '17 at 06:54
No problem. Please include the actual output you are getting. – pylang May 11 '17 at 07:03
So the comment I mentioned earlier trying your code is the output I get from your code. The output which I get from my code is something like this - `['CSF3R'] ['DNMT3A'] ['DNMT3A']` – Srk May 11 '17 at 07:06

Stephen Rauch · Answer 3 · 2017-05-11T07:04:46.970

0

To convert the file to a list of strings, one per line use:

with open('filename.txt') as f:
    list_from_file = [x.strip() for x in f.readlines()]

print(set(list_from_file))

And for a five column file that is tab separated try:

with open('file1') as f:
    col1, col2, col3, col4, col5 = zip(
        *(y.split('\t') for y in (x.strip() for x in f.readlines())))

Then you can use set() on the desired columns

edited May 11 '17 at 07:04

answered May 11 '17 at 06:39

Stephen Rauch

47,830
31
106
135

Raunch - This is an awesome explanation so as to parse the data row by row. So how can go about it for parsing it column by column? – Srk May 11 '17 at 06:52
That is why I asked earlier if the file was exactly as shown. If it is not as shown then we need to know how the columns are separated. – Stephen Rauch May 11 '17 at 06:54
Raunch - As I have mentioned in the question, the file consists of 5 columns and lakhs of rows, the mentioned snippet of the file is just one column and what I want is the unique elements from that column only and the output should be only those unique elements. – Srk May 11 '17 at 07:02
Maybe you can post a first 10 rows of files, with proper what u use for delimited, and have question with bullet point to point out what is the nature of that file, and what you want to do. @Srk – Phung Duy Phong May 11 '17 at 07:08
@PhongPhung : I am sorry to be very unclear with the question, but I will edit the question and post the first ten lines of the file. Thank you for guiding me to do that! – Srk May 11 '17 at 07:11
I think my answer or @StephenRauch answer could do what you need now, welcome – Phung Duy Phong May 11 '17 at 07:17

score 0 · Answer 4 · answered May 11 '17 at 07:04

I think the easiest way to do this will definitely be by using a set. As you are currently using a list of lists: [[CSF3R],[DNMT3A],[DNMT3A], ...] you will be unable to use a set. If you parse your text-file into a single list of strings: [CSF3R, DNMT3A, DNMT3A, ...], you will be able to implement a set - problem solved!

You can just look at the implementations above for help. Also, if you want better help, just post the format of your text-file, so others can poke around and maybe find even better solutions.

All the best

Get unique elements after parsing a column from a file in python

4 Answers4