Read a text file with tab and semicolon in python

Question

Is there a way to read a table with tab and semicolon delimits in python?

The table looks like below:

chr1    match    158337    160567    .    -    .    fam=LINE;Target=RIL 356 2619;ID=RIL-map20;Order=TE;Class=Unknown;Identity=93.9881;Name=chr1_RIL-Map20

I'd have thought so! What have you tried, and what exactly is the problem with it? — jonrsharpe, Apr 20 '15 at 14:06
Aside from the CSV module, check out `split`: http://stackoverflow.com/a/7215696/1890512 — Curtis Mattoon, Apr 20 '15 at 14:09

mhawke · Accepted Answer · 2015-04-20T14:48:12.720

Use regular expression pattern '\t|;' with re.split():

import re

s = 'chr1\tmatch\t158337\t160567\t.\t-\t.\tfam=LINE;Target=RIL 356 2619;ID=RIL-map20;Order=TE;Class=Unknown;Identity=93.9881;Name=chr1_RIL-Map20'
l = re.split('\t|;', s)

>>> l
['chr1', 'match', '158337', '160567', '.', '-', '.', 'fam=LINE', 'Target=RIL 356 2619', 'ID=RIL-map20', 'Order=TE', 'Class=Unknown', 'Identity=93.9881', 'Name=chr1_RIL-Map20']

The pattern matches a single tab or a single semi-colon (that's what the | means), and so the input string is split on either of these characters.

An alternative is to use pandas.read_csv() with sep set to the same reg ex pattern.

Deacon · Answer 2 · 2015-04-24T03:05:31.333

As @mhawke pointed out, my original solution using the csv module missed the requirement is to split on both \t and ;.

import csv
import itertools
data = ['chr1\tmatch\t158337\t160567\t.t-t.tfam=LINE;Target=RIL 356 2619;ID=RIL-map20;Order=TE;Class=Unknown;Identity=93.9881;Name=chr1_RIL-Map20']
reader = csv.reader(data, delimiter='\t')
record = [i for i in itertools.chain(*[i for i in
                                       (j for row in reader
                                          for item in row
                                          for j in csv.reader([item], delimiter=';'))])]
print(record)
# ['chr1', 'match', '158337', '160567', '.t-t.tfam=LINE', 'Target=RIL 356 2619', 'ID=RIL-map20', 'Order=TE', 'Class=Unknown', 'Identity=93.9881', 'Name=chr1_RIL-Map20']

I like using the csv module, since it leverages all the functionality in that module.

Update

Now that I took a moment to think about it, I rewrote it to get rid of the need for the itertools module:

import csv
data = ['chr1\tmatch\t158337\t160567\t.t-t.tfam=LINE;Target=RIL 356 2619;ID=RIL-map20;Order=TE;Class=Unknown;Identity=93.9881;Name=chr1_RIL-Map20']
reader = csv.reader(data, delimiter='\t')
record = [i for i in (k for row in reader
                            for item in row
                            for j in csv.reader([item], delimiter=';')
                            for k in j)]
print(record)
# ['chr1', 'match', '158337', '160567', '.t-t.tfam=LINE', 'Target=RIL 356 2619', 'ID=RIL-map20', 'Order=TE', 'Class=Unknown', 'Identity=93.9881', 'Name=chr1_RIL-Map20']

I think that the OP need columns delimited by both tab and semicolon on the _same_ line. This answer works for one or the other. — mhawke, Apr 20 '15 at 14:52

Read a text file with tab and semicolon in python

2 Answers2

Update