Parse text with uncertain number of fields

Question

I have a file (~50,000 lines) text.txt as below, which contains some gene info from five individuals (AB, BB, CA, DD, GG). The \t in the file is a tab seperator. There are also a lot of info that are not useful in the file, and I would like to clean it up. So What I need is to extract the species name with 'transcript=' id, if they have one, and also extract the 'DD:' and 'GG:' parts.

$head text.txt

GeneA\tAB:xrbyk | jdnif | otherText\tBB:abdf | jdhfkc | otherDifferentText\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneB\tBB:dfsfg | dfsfvdf | otherDifferent\tCA:zdcfsdf | xfgdfgs | transcript=sdfs.1\tDD:sdfsw.1 type=cds\tGG:fghfg.1 type=cds
GeneC\tAB:dsfsdf | xdvv | otherText1\tBB:xdsd | sdfsdf | otherDifferentText2\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneD\tAB:dfsdf | Asda | transcript=asdasd.2\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds

and I would like the output to be

GeneA\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1
GeneB\tCA:transcript=sdfs.1\tDD:sdfsw.1\tGG:fghfg.1
GeneC\tDD:hudnf.1\tGG:jdubf.1
GeneD\tAB:transcript=asdasd.2\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1

I have been searching and thinking for a whole afternoon already, and only have the idea of tearing this file apart by species with the first column Gene name, then separately modify each file, and finally merge files together according to the gene name. But as you see, each line of the file does not necessary have the same number of fields, and so I can't simply use awk to print a certain column. I'm not sure how I can tear them up by species.

I tried to mimic the use of this one How to use sed/grep to extract text between two words?, but did not come with success. I also read a bit about Python in how to split text, (as I'm starting to learn this language), but still can't figure it out. Could anyone please help? Thanks a lot!

UPDATE OF CLARIFICATION OF THE INPUT DATA: In the example showed above, the gene info of each individual is separated by tab (\t), which means that all the text after the inidividual name plus colon (e.g. AB:) belongs to the individual (e.g. "xrbyk | jdnif | otherText" for AB). Whether to keep the individual in the final output depends on whether there is the information of "transcript=" for that individual, except for DD and GG. This is why in the final output the 1st line start with CA but not with AB.

please update the question to include details on when/how you change the 'individual' for a line (eg, 1st line starts with `AB` but ends up with `CA`); I'm assuming this has something to do with the existence of a 'otherXXXX` field but it's not clear from the sample input which such fields to process — markp-fuso, Sep 21 '22 at 20:43
Sorry for the confusion. I added a bit more clarification of the data. Hope it's better now? — zzz, Sep 21 '22 at 21:33
still confused; the update doesn't explain 1) why the 1st 3 individuals are changed but the 4th is not and 2) it doesn't explain how to determine what the new individual value should be; for example: for line 1 ... how do you know to replace `AB` and how do you know to replace it with `CA`? for line 3 there is no `transcript=` entry but you still replace `AB` with `DD`; for line 4 ... has multiple `transcript=` entries but you do *not* replace `AB` — markp-fuso, Sep 21 '22 at 21:47

Hai Vu · Accepted Answer · 2022-09-21T22:50:16.800

3

This solution is a bit long, but should be easy to work with:

#!/usr/bin/env python3

# main.py

import csv
import fileinput
import re


def filter_fields(row):
    output = []
    for field_number, field in enumerate(row, 1):
        if field_number == 1:
            output.append(field)
        elif "DD:" in field or "GG:" in field:
            output.append(field.split()[0])
        elif "transcript=" in field:
            # Remove stuff from after the colon to the last space
            output.append(re.sub(r":.* ", ":", field))

    return "\t".join(output)


reader = csv.reader(fileinput.input(), delimiter="\t")
for row  in reader:
    print(filter_fields(row))

How to run it:

# Output to the screen
python3 main.py text.txt

# Output to a file
python3 main.py text.txt > out.txt

# Use as a filter
cat text.txt | python3 main.py

Notes

In this solution, each line of text is broken into a row of fields.
The function filter_fields will take each row, decide what field to to keep and reformat. It then return those fields, tab separated.
The re.sub(...) call says: Delete everything after the colon, up to the last space.

edited Sep 21 '22 at 22:50

answered Sep 21 '22 at 22:42

Hai Vu

37,849
11
66
93

Thanks! This solution is the easiest for me to understand! Thanks a lot! – zzz Sep 22 '22 at 12:35
Hello, sorry to bother you again, and hope you don't mind. I amended a bit your code main.py for my real data, by deleting the `if field_number == 1: output.append(field)`, because I don't want to keep the lines if the rows do not match the conditions. But the code still keeps all those unmatching rows as empty lines in the output. I don't really understand why. Do you mind to give me some hints? Thanks in advance – zzz Sep 23 '22 at 15:38
So, you want to print the line only if one of these conditions are true: (a) has "DD:", (b) has "GG:", or (c) has "transcript:". If none of these is true, then you don't output. Is this a correct assumption? – Hai Vu Sep 24 '22 at 02:51

lordadmira · Answer 2 · 2022-09-22T02:17:19.863

Assuming those \t in your sample text are real tabs, this Perl one liner will do it. If they're literal \t text then this needs to be tweaked a tad. Put each field you want to grab in the regex alternation after GG:.

perl -lne '@wanted = $_ =~ m{(^Gene[ABCD]|(?:transcript=|DD:|GG:)\S+)}g; print join "\t", @wanted if @wanted; ' inputfile.txt > outputfile.txt

Output:

GeneA   transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1
GeneB   transcript=sdfs.1       DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   transcript=asdasd.2     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1

With regard to your clarification, this one liner will give you very fine control over what gets captured. This is more of a full parse approach.

perl -lne '($gene, @parts) = split/\t/,$_; my @wanted; foreach $p (@parts) { $p =~ m{(?|^([A-Z]{2}:).*?(transcript=\S+)|(DD:\S+)|(GG:\S+))} and push @wanted, "$1$2"; }; print join "\t", $gene, @wanted if @wanted; ' inputfile.txt > outputfile.txt

Output:

GeneA   CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1
GeneB   CA:transcript=sdfs.1    DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   AB:transcript=asdasd.2  CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1

This gets all DD and GG first components as well as any ID with a transcript= in it.

It works like this.

split out the gene and break up each tab seperated part
within each part grab the bits we care about with regex
save the bits that matched and push it into @wanted only if the regex matched
the (?| regex term causes the numbered capture variables to remain the same across alternations rather than keep going up. i.e. we only have $1 and $2 instead of $1, $2, $3, $4.
print out the line if there is anything @wanted

Bear in mind that with each "simplification" of the task you risk losing data that you actually want. You might end up with a full fledged parser if the data is complex enough.

Click here to read more about how to use Perl for pipelining. https://perldoc.perl.org/perlrun

Thanks! That looks great! exactly what I was looking for, except that the individual names are lacking before the transcript. Do you mind to amend the code a bit? Also, do you mind to explain your code? Thanks! — zzz, Sep 21 '22 at 21:36
So do you want anything preceding `transcript=`? Or only if it's certain letter codes. — lordadmira, Sep 21 '22 at 22:24
Thanks a lot for amending the code and also adding the explanation! That's really helpful! — zzz, Sep 22 '22 at 12:46

Polar Bear · Answer 3 · 2022-09-22T01:26:38.477

The task is trivial and can be solved in one line of code by defining regular expression specifying information of interest.

To test on input data substitute while <DATA> with while <> and give filename as an argument to the script.

use strict;
use warnings;
use feature 'say';

my $re = qr/(Gene.|[A-Z]{2}:(?=[^:]*transcript=)|transcript=\S+|GG:\S+|DD:\S+)/;

say join("\t",/$re/g) while <DATA>;

exit 0;

__DATA__
GeneA   AB:xrbyk | jdnif | otherText    BB:abdf | jdhfkc | otherDifferentText   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneB   BB:dfsfg | dfsfvdf | otherDifferent CA:zdcfsdf | xfgdfgs | transcript=sdfs.1    DD:sdfsw.1 type=cds GG:fghfg.1 type=cds
GeneC   AB:dsfsdf | xdvv | otherText1   BB:xdsd | sdfsdf | otherDifferentText2  DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneD   AB:dfsdf | Asda | transcript=asdasd.2   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds

Output

GeneA   CA:     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1
GeneB   CA:     transcript=sdfs.1       DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   AB:     transcript=asdasd.2     CA:     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1

With a small code change the exact desired output can be achieved

use strict;
use warnings;
use feature 'say';

my $re = qr/(Gene.|[A-Z]{2}:transcript=\S+|GG:\S+|DD:\S+)/;

while( <DATA> ) {
    s/[^:]*(?=transcript=)//g;
    say join("\t",/$re/g);
}

exit 0;

__DATA__
GeneA   AB:xrbyk | jdnif | otherText    BB:abdf | jdhfkc | otherDifferentText   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneB   BB:dfsfg | dfsfvdf | otherDifferent CA:zdcfsdf | xfgdfgs | transcript=sdfs.1    DD:sdfsw.1 type=cds GG:fghfg.1 type=cds
GeneC   AB:dsfsdf | xdvv | otherText1   BB:xdsd | sdfsdf | otherDifferentText2  DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneD   AB:dfsdf | Asda | transcript=asdasd.2   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds

Output

GeneA   CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1
GeneB   CA:transcript=sdfs.1    DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   AB:transcript=asdasd.2  CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1

Parse text with uncertain number of fields

3 Answers3