0

I need to be able to search strings like -

- TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor

-TATA box binding protein (TBP)-associated factor, RNA polymerase II, C2, 105kD

-ir

-TABK-1

in a text segment containing these strings. I repeat, the strings mentioned above are to be searched in a larger text segment.

Doing re.search with \b (word boundary) doesn't work for this (because of the special characters like brackets, hyphen, comma etc.) and since I have simple strings like ir, mo etc. too, I can't just do if string1 in text_segment, since it would lead to within-word matches, which are to be avoided (like if string1 in text_segment will match ir within 'Birth').

One approach I thought of was to break the text segment using many splitters, like comma, brackets, space, hyphen etc. and do the same for the string to be found. Then join both of them back (respectively) using spaces, and then search for the string in the segment using word boundaries. But I wanted to know if there was a better method to solve the problem. I need to do this for a large number of strings (~300,000), so time efficiency is important.

EXAMPLES -

The search algorithm should be able to work for both of the following cases-

Case 1:

string = TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor

text segment =

Leukemia stem cells (LSCs) are an attractive target in treatment of many types of blood cancers. There remains an incomplete understanding of the epigenetic mechanisms driving LSC formation and maintenance, and how this compares to the epigenetic regulation of normal hematopoietic stem cells (HSCs)."!Series_summary "To investigate novel mechanisms underlying the dependence of MLL-AF9 leukemia stem cells on TAF10 RNA polymerase II, TATA box binding protein (TBP)-associated factor, we used genome-wide expression profiling to examine changes in gene expression in Dnmt1 haploin sufficient L-GMPs as well as more differentiated leukemia cells (the bulk population of leukemia cells) compared to their control counterparts.

Case 2:

string = ir

text segment =

These cells were transduced with MLL-AF9-IRES-GFP retrovirus for 2 days, then GFP+ cells were sorted and transplanted into C57BL/6 syngeneic sublethally irradiated (600 rad) recipients. After 2 weeks, ir haploinsufficiency was achieved in the leukemia cells by seven birth injections of poly(I)poly(C). Upon development of terminal stage shirt myeloid leukemia, populations of L-GMPs or bulk GFP+ leukemia cells were FACS sorted from leukemic spleens, RNA was extracted, amplified and hybridized to Affymetrix

(it should not match the italicized cases of ir above, as they are within another word)

Community
  • 1
  • 1
user1993
  • 498
  • 1
  • 10
  • 22
  • 1
    Try `re.search(r'(?<!\w)(?:ir|\(mir\)|etc\.)(?!\w)', s)` – Wiktor Stribiżew May 20 '17 at 20:14
  • Before any more hints, you should explain what is the final result you are seeking. The solution above shows you how to forget about the problems with word boundaries (use `(?<!\w)` and `(?!\w)`. – Wiktor Stribiżew May 21 '17 at 09:35
  • @WiktorStribiżew, I am seeking to *find* strings of both kinds as mentioned in my 2 examples I recently added. You can have a look. And how is your idea better than simple `\b`? – user1993 May 21 '17 at 10:01
  • No idea because your *problem* is unclear. – Wiktor Stribiżew May 21 '17 at 10:10
  • @WiktorStribiżew, sorry if i am not clear. But here is a concrete example. In the link - https://ideone.com/laS92S, the regex is not able to find the string in the text. I want to be able to find the string in the text – user1993 May 21 '17 at 10:54
  • You just did not escape the search string. Use `re.escape` – Wiktor Stribiżew May 21 '17 at 13:49
  • @WiktorStribiżew, thanks that worked. could you write it as an answer so that i can accept it and also could you explain how this differs from simple `(\b)string(\b)` because that also seems to work – user1993 May 21 '17 at 13:57
  • @WiktorStribiżew, thanks for the linked question. i thought about it and concluded that `!\w` is better than `\b` because then my string can be searched out even if it is flanked by things like hyphen, comma etc. Am i right? – user1993 May 21 '17 at 16:43
  • 1
    `\b` matches at the positions between a word and non-word chars, start of string and a word char, a word char and the end of string. So, it depends on the type of a char. `(?<!\w)` just matches a single position not preceded with a word char, and `(?!\w)` matches a position not followed with a word char. It might not be what you need, word boundaries are an ambiguous subject. Still, these `(?<!\w)` / `(?!\w)` lookarounds are more safe when your search phrases may begin/end with arbitrary type of chars. – Wiktor Stribiżew May 21 '17 at 17:32

0 Answers0