I need to be able to search strings like -
- TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor
-TATA box binding protein (TBP)-associated factor, RNA polymerase II, C2, 105kD
-ir
-TABK-1
in a text segment containing these strings. I repeat, the strings mentioned above are to be searched in a larger text segment.
Doing re.search
with \b
(word boundary) doesn't work for this (because of the special characters like brackets, hyphen, comma etc.) and since I have simple strings like ir
, mo
etc. too, I can't just do if string1 in text_segment
, since it would lead to within-word matches, which are to be avoided (like if string1 in text_segment
will match ir
within 'Birth').
One approach I thought of was to break the text segment using many splitters, like comma, brackets, space, hyphen etc. and do the same for the string to be found. Then join both of them back (respectively) using spaces, and then search for the string in the segment using word boundaries. But I wanted to know if there was a better method to solve the problem. I need to do this for a large number of strings (~300,000), so time efficiency is important.
EXAMPLES -
The search algorithm should be able to work for both of the following cases-
Case 1:
string = TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor
text segment =
Leukemia stem cells (LSCs) are an attractive target in treatment of many types of blood cancers. There remains an incomplete understanding of the epigenetic mechanisms driving LSC formation and maintenance, and how this compares to the epigenetic regulation of normal hematopoietic stem cells (HSCs)."!Series_summary "To investigate novel mechanisms underlying the dependence of MLL-AF9 leukemia stem cells on TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor, we used genome-wide expression profiling to examine changes in gene expression in Dnmt1 haploin sufficient L-GMPs as well as more differentiated leukemia cells (the bulk population of leukemia cells) compared to their control counterparts.
Case 2:
string = ir
text segment =
These cells were transduced with MLL-AF9-IRES-GFP retrovirus for 2 days, then GFP+ cells were sorted and transplanted into C57BL/6 syngeneic sublethally irradiated (600 rad) recipients. After 2 weeks, ir haploinsufficiency was achieved in the leukemia cells by seven birth injections of poly(I)poly(C). Upon development of terminal stage shirt myeloid leukemia, populations of L-GMPs or bulk GFP+ leukemia cells were FACS sorted from leukemic spleens, RNA was extracted, amplified and hybridized to Affymetrix
(it should not match the italicized cases of
ir
above, as they are within another word)