How to get the unique elements of the first column of a text file?

Question

I am processing a text file whose columns are separated by tabs .I want to get all the unique values of the first column.

Text Input e.g:

"a\t\xxx\t..\zzz\n
 a\t\xxx\t....\n
 b\t\xxx\t.....\n
 b\t\xxx\t.....\n
 c\t\xxx\t.....\n"

So in this case i would like to get an array: uniques=["a","b","c"]

Code:

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[(lambda line: itertools.takewhile(lambda char: char!='\t',line))for line in lines]

Instead of the desired values i get a list of :

<function getData.<locals>.<listcomp>.<lambda> at 0x000000000C46DB70>

I have already read this article Python: Lambda function in List Comprehensions and I unserstood that you have to use parenthesis to ensure the right execution order.Still i get the same result.

IMO, not a good idea to write such a complicated thing in a list comprehension. — Tai, Jan 01 '18 at 21:39

score 3 · Answer 1 · answered Jan 01 '18 at 21:30

You can just use split():

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[line.split('\t')[0] for line in lines]

Note that this will not produce unique values, it will produce every line's value. To make this unique, do:

uniques = list(set(uniques))

score 2 · Answer 2 · answered Jan 01 '18 at 21:32

2

May be csv can simplify your problem:

>>> import csv
>>> with open(fin, 'rb') as csvfile:
...      spamreader = csv.reader(csvfile, delimiter='\t')
...      list(set( row[0] for row in spamreader ))
['a', 'c', 'b']

answered Jan 01 '18 at 21:32

dani herrera

48,760
8
117
177

By the way, in python 2.7, may be you will need [unicodecsv](https://github.com/jdunck/python-unicodecsv) – dani herrera Jan 01 '18 at 21:43

score 1 · Answer 3 · answered Jan 01 '18 at 21:24

1

You can use regex:

import re
s = """
   a\txxx\t..\zzz\n
   a\txxx\t....\n
   b\txxx\t.....\n
   b\txxx\t.....\n
   c\txxx\t.....\n"
   """
new_data = re.findall('(?<=\n\s\s\s)[a-zA-Z]', s)
uniques = [a for i, a in enumerate(new_data) if a not in new_data[:i]]

Output:

['a', 'b', 'c']

answered Jan 01 '18 at 21:24

Ajax1234

69,937
8
61
102

I know thank you for pointing me out but i felt comfortable with lambda constructs in other languages so i thought it should be similar here too. – Bercovici Adrian Jan 01 '18 at 21:30

Patrick Artner · Answer 4 · 2018-01-01T21:40:41.017

1

After

lines=input.readlines()[1:]         # reads all lines after the header 
                                    # you read already and skips the 1st one

uniques = list(set(x.split('\t')[0] for x in lines))

Caveat: This might reorder your uniques

edited Jan 01 '18 at 21:40

answered Jan 01 '18 at 21:27

Patrick Artner

50,409
9
43
69

1

It is `list(set( x.split('\t')[0] for x in lines)) `, without the braked. – dani herrera Jan 01 '18 at 21:39
Thank you for your answer indeed it works as a charm ,i was interested in the itertools since i believe you can manipulate data generically with ease. – Bercovici Adrian Jan 01 '18 at 21:48

score 1 · Answer 5 · answered Jan 01 '18 at 21:34

1

Try Pandas

import pandas as pd

df = pd.read_csv(filename, sep='\t')
uniques = df[df.columns[0]].unique()

answered Jan 01 '18 at 21:34

Trung Lê Hoàng

35
1
8

score 1 · Accepted Answer · answered Jan 01 '18 at 21:40

1

When looking for unique elements set() is a good solution:

def getData(fin):
    with open(fin, 'r') as input:
    first_cols = list(set([line.split("\\")[0] for line in input.readlines()]))

answered Jan 01 '18 at 21:40

Tomasz Wiśniewski

129
5

Matthias Fripp · Answer 7 · 2018-01-02T06:29:02.473

Your list comprehension needs to start with an expression rather than a lambda. Currently your code just creates a list of lambdas (note that the outermost parentheses enclose a lambda, not an expression). You could fix it like this:

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[itertools.takewhile(lambda char: char!='\t',line) for line in lines]

There are still a couple of bugs in this code: (1) by the time you get to readlines(), the first row will already have been removed from the input buffer, so you should probably drop the [1:]. (2) your uniques variable will have all the entries from the first column, including duplicates.

You could fix these bugs and streamline the code a little more like this:

with open(fin, 'r',encoding='utf-16') as input:
    headers=input.next().split('\t')
    uniques = set(line.split('\t')[0] for line in input)
    uniques = list(uniques)

score 0 · Answer 8 · answered Jan 02 '18 at 13:41

If order doesn't matter then try this approach,

Open the file and then just split the words and as you said first column is always what you want to just take what you need and leave the remaining content.

with open('file.txt','r') as f:
    print(set([list(line)[0] for line in f]))

output:

{'b', 'a', 'c'}

How to get the unique elements of the first column of a text file?

8 Answers8