Parsing through .txt file to create tab delimited output file

Question

[MacOS, Python 2.7]

I am trying to parse through a .txt file and pull out the strings I want to create a tab-delimited table. I will have to do this for many files, but I'm having trouble selecting some strings.

The following is an input file example:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome
CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

And my code so far looks like the following, with headstring indicating the column headers:

# Open the input file for reading 
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')

# Write the header
Headstring= "GenBank_Assembly_ID    RefSeq_Assembly_ID  Assembly_level  Chromosome Plasmid  Refseq_chromosome   Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"

# Set up chromosome and plasmid count
ccount = 0
pcount = 0

# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
    for line in searchfile:
        if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
            GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
            print GCA.group(1)
            GCA = GCA.group(1)
        if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
            GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
            print GCF.group(1)
            GCF = GCF.group(1) 
        if re.search ( r'level: (.+$)', line, re.M|re.I):
            assembly = re.search( r'level: (.+$)', line, re.M|re.I)
            print assembly.group(1)
            assembly = assembly.group(1)
        if "Chromosome" in line:
            ccount += 1
            print ccount
        if "Plasmid" in line:
            pcount += 1
            print pcount
    
        

OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)


OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)


InFile.close()
OutFile.close()

The main issue I'm having is I want to extract the strings NC_010611.1, NC_010605.1, and NC_010606.1, and have tab spaces in between them on the same line so they end up under the headers Refseq_chromosome, Refseq_plasmid1, and Refseq_plasmid2 respectively. But I only want the script to search for these if assembly = "Chromosome" or "Complete Genome". I'm not sure how to search for a string only if this condition is true.

I know the regex expression for getting these strings could be =\t(\w+..), but that's as far as I got.

I'm very new to Python, so explanations would be great.

score 3 · Accepted Answer · answered May 14 '18 at 13:30

Have a look at this example:

import re

InFileName  = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
    chromosomes = []
    plasmids = []
    for line in InFile:
        if line.lstrip()[0] == '#':
            # Process header part of the file differently from the data part
            if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
                GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
                print GCA.group(1)
                GCA = GCA.group(1)
            if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
                GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
                print GCF.group(1)
                GCF = GCF.group(1)
            if re.search ( r'level: (.+$)', line, re.M|re.I):
                assembly = re.search( r'level: (.+$)', line, re.M|re.I)
                print assembly.group(1)
                assembly = assembly.group(1)
        elif assembly in ['Chromosome', 'Complete Genome']:
            # Process each data line separately
            split_line = line.split()
            Type = split_line[3]
            RefSeq_Accn = split_line[6]
            if Type == "Chromosome":
                chromosomes.append(RefSeq_Accn)
            if Type == "Plasmid":
                plasmids.append(RefSeq_Accn)

    # Merge names of up to N chromosomes
    N = 1
    cstr = ''
    for i in range(N):
        if i < len(chromosomes):
            nextChromosome = chromosomes[i]
        else:
            nextChromosome = ''
        cstr += '\t' + nextChromosome

    # Merge names of up to M plasmids
    M = 5
    pstr = ''
    for i in range(M):
        if i < len(plasmids):
            nextPlasmid = plasmids[i]
        else:
            nextPlasmid = ''
        pstr += '\t' + nextPlasmid

    OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
    OutputString += cstr
    OutputString += pstr

    OutFile.write(Headstring+'\n'+OutputString)

Input:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn     Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

Output:

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1      GCF_000018445.1         Complete Genome 1           2       NC_010611.1        NC_010605.1     NC_010606.1

The main differences from your script:

I use condition if line.lstrip()[0] == '#' to process the "header" lines (the lines starting with a hash character) differently from the "table rows" at the bottom (the lines actually containing data for each sequence).
I use the condition if assembly in ['Chromosome', 'Complete Genome'] - this is the condition you specified in your question
I split each table row into values like this split_line = line.split(). After that I acquire the type by Type = split_line[3] (this is the fourth column in the table data) and RefSeq_Accn = split_line[6] gives me the seventh column in the table.

Thank you, this is exactly the output I'm looking for! But I think that something is going wrong for me- my output gives me 0 chromosomes and 0 plasmids, and there is "¿" following "Complete Genome" in the output file as well. Is there a reason it may not be working for me? — D.Parker, May 14 '18 at 15:32
@D.Parker, I don't see anything that might cause your problem. I would suggest to run my code against my input and see what happens. If you are experiencing this problem for some other input, it might just have other format, not specified in your question. — Andriy Makukha, May 14 '18 at 16:08
Ah, it was a problem with the input file I was using, the format was not exactly the same, I apologize! Thanks for the help! — D.Parker, May 15 '18 at 07:45
Sorry, I have one more question- so actually in my input file, the table rows are parsed by tab spaces (versus just spaces)- I thought adding split_line = line.split('\t') would solve this problem, but it doesn't...do you have any advice on what I could try, or if I'm making a syntax error? — D.Parker, May 15 '18 at 11:10
@D.Parker, `line.split()` works for both tabs and spaces. I don't know what's going on without seeing the input that fails. You can update your question by adding the input file that causes you trouble. — Andriy Makukha, May 15 '18 at 14:44
Ah I just had to specify the input file as "rU"- because I have a Mac I guess, so the newline character is not the same? But regardless, thank you so much! — D.Parker, May 16 '18 at 09:16

score 0 · Answer 2 · answered May 14 '18 at 12:39

You could just read all of the data into a pandas dataframe first to start with. Then you could process one column (whatever contains 'NC_010611.1') in a way that is conditional on another column. See examples here: Pandas conditional creation of a series/dataframe column.

It's probably possible to get what you want in one pass through the data but it may be easier to write and read if you do 2 passes through the data.

Parsing through .txt file to create tab delimited output file

2 Answers2