1

I am processing a text file whose columns are separated by tabs .I want to get all the unique values of the first column.

Text Input e.g:

"a\t\xxx\t..\zzz\n
 a\t\xxx\t....\n
 b\t\xxx\t.....\n
 b\t\xxx\t.....\n
 c\t\xxx\t.....\n"

So in this case i would like to get an array: uniques=["a","b","c"]

Code:

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[(lambda line: itertools.takewhile(lambda char: char!='\t',line))for line in lines]

Instead of the desired values i get a list of :

<function getData.<locals>.<listcomp>.<lambda> at 0x000000000C46DB70>

I have already read this article Python: Lambda function in List Comprehensions and I unserstood that you have to use parenthesis to ensure the right execution order.Still i get the same result.

Bercovici Adrian
  • 8,794
  • 17
  • 73
  • 152
  • IMO, not a good idea to write such a complicated thing in a list comprehension. – Tai Jan 01 '18 at 21:39

8 Answers8

3

You can just use split():

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[line.split('\t')[0] for line in lines]

Note that this will not produce unique values, it will produce every line's value. To make this unique, do:

uniques = list(set(uniques))
Chris Applegate
  • 752
  • 3
  • 11
2

May be csv can simplify your problem:

>>> import csv
>>> with open(fin, 'rb') as csvfile:
...      spamreader = csv.reader(csvfile, delimiter='\t')
...      list(set( row[0] for row in spamreader ))
['a', 'c', 'b']
dani herrera
  • 48,760
  • 8
  • 117
  • 177
1

You can use regex:

import re
s = """
   a\txxx\t..\zzz\n
   a\txxx\t....\n
   b\txxx\t.....\n
   b\txxx\t.....\n
   c\txxx\t.....\n"
   """
new_data = re.findall('(?<=\n\s\s\s)[a-zA-Z]', s)
uniques = [a for i, a in enumerate(new_data) if a not in new_data[:i]]

Output:

['a', 'b', 'c']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I know thank you for pointing me out but i felt comfortable with lambda constructs in other languages so i thought it should be similar here too. – Bercovici Adrian Jan 01 '18 at 21:30
1

After

lines=input.readlines()[1:]         # reads all lines after the header 
                                    # you read already and skips the 1st one

uniques = list(set(x.split('\t')[0] for x in lines)) 

Caveat: This might reorder your uniques

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

Try Pandas

import pandas as pd

df = pd.read_csv(filename, sep='\t')
uniques = df[df.columns[0]].unique()
1

When looking for unique elements set() is a good solution:

def getData(fin):
    with open(fin, 'r') as input:
    first_cols = list(set([line.split("\\")[0] for line in input.readlines()]))
0

Your list comprehension needs to start with an expression rather than a lambda. Currently your code just creates a list of lambdas (note that the outermost parentheses enclose a lambda, not an expression). You could fix it like this:

def getData(fin):
    input = open(fin, 'r',encoding='utf-16')
    headers=input.readline().split()
    lines=input.readlines()[1:]
    uniques=[itertools.takewhile(lambda char: char!='\t',line) for line in lines]

There are still a couple of bugs in this code: (1) by the time you get to readlines(), the first row will already have been removed from the input buffer, so you should probably drop the [1:]. (2) your uniques variable will have all the entries from the first column, including duplicates.

You could fix these bugs and streamline the code a little more like this:

with open(fin, 'r',encoding='utf-16') as input:
    headers=input.next().split('\t')
    uniques = set(line.split('\t')[0] for line in input)
    uniques = list(uniques)
Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
0

If order doesn't matter then try this approach,

Open the file and then just split the words and as you said first column is always what you want to just take what you need and leave the remaining content.

with open('file.txt','r') as f:
    print(set([list(line)[0] for line in f]))

output:

{'b', 'a', 'c'}