I am looking to optimize the performance of a big data parsing problem I have using python
. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.
Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.
I have been using numpy
recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?
My files contain multiple segments with the pattern
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns
<DATA BEGIN FLAG> # the word 'DATA'
<DATA COLUMNS> # variable number of columns
<DATA END FLAG> # the pattern '//'
<EMPTY LINE>
Example:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
<ETC>
I will use segments where the species homo_sapiens
and macaca_mulatta
are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta
does not appear in the second example, I would ignore this segment completely.
I care about segment_start
and segment_end
coordinates for homo_sapiens
only, so in segments where homo_sapiens
is present, I will record these fields and use them as keys to a dict()
. segment_start
also tells me the first positional coordinate for homo_sapiens
, which increases strictly by 1 for each line of data in the current segment.
I want to compare the letters (DNA bases) for homo_sapiens
and macaca_mulatta
. The header line where homo_sapiens
and macaca_mulatta
appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.
Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.
Looking at the two lines of data in example 1, the relevant information for me is
# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T
For each segment containing info for homo_sapiens
and macaca_mulatta
, I will record start and end for homo_sapiens
from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.
aaa--A
I will only record from positions where homo_sapiens
and macaca_mulatta
both have valid bases (must be in the set ACGT
) so the last variable I consider is a counter of valid bases per segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions),
(segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}
Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(chromosome):
"""
A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
:param chromosome: chromosome (integer:
:return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
"""
ch = str(chromosome)
div_dict = {}
with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_info
# SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)
# flags, containers, counters and indices:
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False # a flag signalling that the lines we are viewing are alignment columns
for line in f:
if 'SEQ' in line: # 'SEQ' signifies a segment info field
assert species_data is False
line = line.split()
if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'
species += [line[1]] # collect each species in the header
starts += [int(line[3])] # collect starts and ends
ends += [int(line[4])]
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
# get the indices to scan in data columns:
hom = species.index('homo_sapiens')
mac = species.index('macaca_mulatta')
pos = starts[hom] # first homo_sapiens positional coordinate
continue
if species_data and '//' not in line:
assert pos > 0
# record the relevant bases:
human = line[hom]
macaque = line[mac]
if {human, macaque}.issubset(bases):
valid += 1
if human != macaque and {human, macaque}.issubset(bases):
mismatch += [pos]
pos += 1
elif species_data and '//' in line: # '//' signifies segment boundary
# store segment results if a boundary has been reached and data has been collected for the last segment:
div_dict[(starts[hom], ends[hom], valid)] = mismatch
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False
elif not species_data and '//' in line:
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
pos = -1
hom = None
mac = None
return div_dict
This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read()
takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).
Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?
Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!