I am working with Python 3.6. I have a tsv file which consists of 5 columns and > 100k of rows. I have used the split
function to parse the file by the delimiter from which I receive specific columns with indices. The column which I am working on looks like this:
CSF3R
DNMT3A
DNMT3A
DNMT3A
DNMT3A
CBLB
PDGFRA
KIT
TET2
TET2
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
CUX1
EZH2
EZH2
RAD21
ABL1
NOTCH1
NOTCH1
ETV6
ETV6
ETV6
FLT3
FLT3
TP53
TP53
What I need to do is get all the unique elements present in this column and print only one of them. I have tried plenty of functions such as join, set, tried to follow many other stackoverflow posts but none of them really solve my problem.
Plus the data I receive is in the 'str' format and not in a list. Therefore I tried getting all of them in a list as well, failing to do that as well. I cannot work with python pandas because all of my fellow associates do not have any idea on the package.
Therefore the normal procedure oriented code I tried is:
file=open('filename.txt')
next(file)
stripped=()
pos=()
s="-"
for line in file:
stripped=line.strip()
pos=stripped.split("\t")
pos[2]= [y for y in (x.strip() for x in pos[2].splitlines()) if y]
print(pos[2])
The output gives a list of all of the strings, that is each string is enclosed in a separate list and not in a single list.
From the above list, my expected output is:
CSF3R
DNMT3A
CBLB
PDGFRA
KIT
TET2
CUX1
EZH2
RAD21
ABL1
NOTCH1
ETV6
FLT3
TP53
This gives me only the unique elements.
To get the unique elements, do I have to get all of these in a single list or is there any other better way to do that?
The file that I am working on.