[MacOS, Python 2.7]
I am trying to parse through a .txt file and pull out the strings I want to create a tab-delimited table. I will have to do this for many files, but I'm having trouble selecting some strings.
The following is an input file example:
# Assembly name: ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome
CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na
And my code so far looks like the following, with headstring indicating the column headers:
# Open the input file for reading
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')
# Write the header
Headstring= "GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"
# Set up chromosome and plasmid count
ccount = 0
pcount = 0
# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
for line in searchfile:
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
if "Chromosome" in line:
ccount += 1
print ccount
if "Plasmid" in line:
pcount += 1
print pcount
OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)
OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)
InFile.close()
OutFile.close()
The main issue I'm having is I want to extract the strings NC_010611.1
, NC_010605.1
, and NC_010606.1
, and have tab spaces in between them on the same line so they end up under the headers Refseq_chromosome, Refseq_plasmid1
, and Refseq_plasmid2
respectively. But I only want the script to search for these if assembly = "Chromosome"
or "Complete Genome"
. I'm not sure how to search for a string only if this condition is true
.
I know the regex expression for getting these strings could be =\t(\w+..)
, but that's as far as I got.
I'm very new to Python, so explanations would be great.