2

I am searching an XML file generated from Ms word for some phrases. The thing is that any phrase can be interrupted with some XML tags, that can come between words, or even inside words, as you can see in the example:

</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>

So my approach to handle this problem was to simply reduce all XML tags into clusters of # characters of the same length, so that when I can find any phrase, the regex would ignore all the XML tags between each two characters.

What I need basically is the span of this phrase within the actual xml document, so I will use this span into later processing with the xml document, I cannot use clones.

This approach works remarkablly, but some phrases cause catastropic backtracking, such as the following example, so I need someone to point out where does the backtracking come from, or suggest a better solution to the problem.

================================

Here is an example:

I have this text where there are some clusters of # characters within it (which I want to keep), and the spaces are also unpredictable, such as the following:

Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected

accomplishment (c)#######

In order to match the following phrase:

Relationship to the strategic framework for the period 2014-2015: programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

I came up with this regex to accommodate the unpredictable # and space characters:

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'

And it works fine in all the other phrases that I want to match, but this one has a problem leading to some catastrophic backtracking, can anyone spot it?

The original text is separated with xml tags, so to make it simpler for the regex, I replaced the tags with these # clusters, here is the original text:

</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>

mishik
  • 9,973
  • 9
  • 45
  • 67
hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • 5
    Why not remove all `#` and normalize consecutive spaces into one? This way you can do a simple substring match instead of using a regex. – NPE Jun 29 '13 at 15:57
  • *“some clusters of # characters within it (which I want to keep)”* – Do you want to keep them now or not? – poke Jun 29 '13 at 15:57
  • but why are you doing this? – Anirudha Jun 29 '13 at 15:57
  • I want to keep these clusters, the integrity of the text depends on them – hmghaly Jun 29 '13 at 15:58
  • also, general question: why do you double-escape everything instead of just using a raw string? – Martin Ender Jun 29 '13 at 15:58
  • These clusters are in place of some xml tags which I want to keep, the phrases are simply split with these tags, and I want to find the phrase but keep the tags – hmghaly Jun 29 '13 at 15:59
  • @hmghaly so are they not actually `#` in your input but complex tags instead? – Martin Ender Jun 29 '13 at 16:00
  • Can you show an *actual* example then? Because there is a huge difference between a simple `#*` and something more complex that has a tag-like format. – poke Jun 29 '13 at 16:00
  • 8
    That's bad design. Very bad design. – mishik Jun 29 '13 at 16:01
  • Please all see my edit – hmghaly Jun 29 '13 at 16:04
  • 2
    I don't understand why it's OK to replace the tags with "#" blocks, but not OK to get rid of them entirely. What exactly are you doing with your regex? Are you scanning a larger document for this particular passage (which you need to know the location of), or are you testing many separate documents to find one that matches? – Blckknght Jun 29 '13 at 16:09
  • I am searching an XML document for a phrase, which may have some xml tags within it (and I am replacing these tags with clusters of # with equal character length), and I need to get the exact location/span of this phrase, because I will use it in later processing – hmghaly Jun 29 '13 at 16:13
  • @m.buettner yes, they are originally xml tags, but the clusters of # are of the same length – hmghaly Jun 29 '13 at 16:15
  • How are you replacing the XML tags with #? Doesn't this seem more like an XML traversal problem than a regex one? – le3th4x0rbot Jun 29 '13 at 16:15
  • @mishik I'd appreciate if you can suggest a better design – hmghaly Jun 29 '13 at 16:16
  • It has been suggested already. Remove the "#" or clone the string and remove "#" from the clone. – mishik Jun 29 '13 at 16:18
  • @BaileyS I don't think so, that's why I reduced the xml tags to these clusters to avoid questions about xml cannot be parsed with regex... what we have is a text separated with # characters, how can we search for a phrase within it and get its span? – hmghaly Jun 29 '13 at 16:19
  • @mishik That does not serve the purpose, I need to get the actual span of the phrase within the xml document, if I remove # even from the clone, this will not give me the correct span – hmghaly Jun 29 '13 at 16:21
  • Why would you have xml tags _inside_ words anyway? – mishik Jun 29 '13 at 16:23
  • 1
    @mishik That's what you get when you're saving ms word as xml – hmghaly Jun 29 '13 at 16:25
  • There's xml _between_ words, not _inside_ them. You should be able to just use: `Relationship[#\s]+to[#\s]+the...` – mishik Jun 29 '13 at 16:27
  • @mishik you also get xml inside words too :) To increase knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level. – hmghaly Jun 29 '13 at 16:28
  • Perhaps you should use a package to read a DOCX format. – dawg Jun 29 '13 at 16:31
  • @hmghaly My point is that however you add the # signs would be a great place to start, not with this goofy regex. The regex is bad, if it needs to look like that, use something else. – le3th4x0rbot Jun 29 '13 at 16:43
  • @BaileyS, you stackoverflow people are weird, when I cannot find a solution you say I want you to write the code for me, and if I do, you call it bad design and goofy regex (if I know how to do it perfectly, why would I ask?? :)) – hmghaly Jun 29 '13 at 16:50
  • 3
    STOP parsing XML with regexes. – sleeplessnerd Jun 29 '13 at 17:14
  • @sleeplessnerd who was parsing XML with regexes? – Martin Ender Jun 29 '13 at 18:03
  • @hmghaly I never said you wanted anyone to write code for you. I just looked at your question, saw a very complicated regex, and read that you are essentially trying to do something which seems like a good idea at first, but which in the long run is way more trouble than you think. I have learned this lesson at great personal expense, you are welcome to do the same. Otherwise listen to the several experienced programmers here who are giving you great advice. – le3th4x0rbot Jun 30 '13 at 06:01
  • @m.buettner This whole business about # signs replacing xml tags and regex patterns is a strange attempt to avoid being told that regex is NOT for parsing XML. – le3th4x0rbot Jun 30 '13 at 06:06
  • @hmgally Hello. I corrected and improved my answer. I also posted another answer in this thread concerning the code of *MFc*. Your question was a hard one to answer in my opinion. You'll see in my answers that things aren't as easy as they seem to be when reading the programming programm of mishik, or as you seem to think. If you ever read a minimum my answers , will you continue to pretend they are more feeble and uninteresting than the accepted pseudo-answer ? – eyquem Jul 04 '13 at 17:49
  • heretic solution: use Tkinters regexp for the match, it doesn't have this kind of issue. – schlenk Jul 04 '13 at 18:14
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – Elazar Jul 04 '13 at 23:11

8 Answers8

3

Since the situation is that complex - don't use regex, just go through your line symbol by symbol:

etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
    if symbol == etalone[counter]:
        counter += 1
        if counter == etalone_length:
            print("String matches")
            break
    elif symbol != " " and sybmol != "#":
        # Bad char found
        print("Does not match!")
else:  # exited 'for' before full etalone matched
    print("Does not match!")

I just figured out, that above will not, actually, work if the very first symbol we match is not the one we're looking for. How about this instead:

  1. Clone your string
  2. Remove "#" from clone
  3. Match against pattern
  4. If pattern matches - get the location of matched result
  5. By that location - find which exact occurrence of the first symbol was matched. Like if full line is a#b##ca#d#f and the line we're looking for is adf then we would start matching from the second a symbol.
  6. Find nth occurrence of symbol a in the original line. Set counter =
  7. Use above algorithm (storing as span start and counter before break as span end)
mishik
  • 9,973
  • 9
  • 45
  • 67
  • interesting, but how to get the span from this? – hmghaly Jun 29 '13 at 16:42
  • 1
    You could find the location of the match in the input string by iterating using `enumerate` and simply saving the index of the first and last matched characters. – Blckknght Jun 29 '13 at 16:55
  • @Blckknght, before that you need to know _which_ first char to match. If you need to find "abc" in the line, you should not simply take the very first "a" – mishik Jun 29 '13 at 17:05
  • Ah, that's a good point. If there is a partial match (of even just the first pattern character) then you'll run into trouble. I guess you could reset the `counter` variable, but it would still be sort of messy. – Blckknght Jun 29 '13 at 17:10
3

If I understand the problem correctly, here's a way to tackle the problem without resorting to pathological regular expressions or character-by-character parsing:

def do_it(search, text_orig, verbose = False):
    # A copy of the text without any "#" markers.
    text_clean = text_orig.replace('#', '')

    # Start position of search text in the cleaned text.
    try:               i = text_clean.index(search)
    except ValueError: return [None, None]

    # Collect the widths of the runs of markers and non-markers.
    rgx    = re.compile(r'#+|[^#]+')
    widths = [len(m.group()) for m in rgx.finditer(text_orig)]

    # From that data, we can compute the span.
    return compute_span(i, len(search), widths, text_orig[0] == '#')

And here's a fairly simple way to compute the spans from the width data. My first attempt was incorrect, as noted by eyquem. The second attempt was correct but complex. This third approach seems both simple and correct.

def compute_span(span_start, search_width, widths, is_marker):
    span_end       = span_start + search_width - 1
    to_consume     = span_start + search_width
    start_is_fixed = False

    for w in widths:
        if is_marker:
            # Shift start and end rightward.
            span_start += (0 if start_is_fixed else w)
            span_end   += w
        else:
            # Reduce amount of non-marker text we need to consume.
            # As that amount gets smaller, we'll first fix the
            # location of the span_start, and then stop.
            to_consume -= w
            if to_consume < search_width:
                start_is_fixed = True
                if to_consume <= 0: break
        # Toggle the flag.
        is_marker = not is_marker

    return [span_start, span_end]

And a bunch of tests to keep the critics at bay:

def main():
    tests = [
        #                0123456789012345678901234567890123456789
        ( [None, None], '' ),
        ( [ 0,  5],     'foobar' ),
        ( [ 0,  5],     'foobar###' ),
        ( [ 3,  8],     '###foobar' ),
        ( [ 2,  7],     '##foobar###' ),
        ( [25, 34],     'BLAH ##BLAH fo####o##ba##foo###b#ar' ),
        ( [12, 26],     'BLAH ##BLAH fo####o##ba###r## BL##AH' ),
        ( [None, None], 'jkh##jh#f' ),
        ( [ 1, 12],     '#f#oo##ba###r##' ),
        ( [ 4, 15],     'a##xf#oo##ba###r##' ),
        ( [ 4, 15],     'ax##f#oo##ba###r##' ),
        ( [ 7, 18],     'ab###xyf#oo##ba###r##' ),
        ( [ 7, 18],     'abx###yf#oo##ba###r##' ),
        ( [ 7, 18],     'abxy###f#oo##ba###r##' ),
        ( [ 8, 19],     'iji#hkh#f#oo##ba###r##' ),
        ( [ 8, 19],     'mn##pps#f#oo##ba###r##' ),
        ( [12, 23],     'mn##pab###xyf#oo##ba###r##' ),
        ( [12, 23],     'lmn#pab###xyf#oo##ba###r##' ),
        ( [ 0, 12],     'fo##o##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 0, 12],     'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),
        ( [ 0, 12],     'f##oo##ba###r## aaaaaBLfoob##ar' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##arAH' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##ar#AH' ),
        ( [ 0, 12],     'foo##ba#####r## aaaaBL#foob##ar' ),
        ( [ 1, 12],     '#f#oo##ba###r## aaaBL##foob##arAH' ),
        ( [ 1, 12],     '#foo##ba####r## aaaBL##foob##ar#AH' ),
        ( [ 2, 12],     '#af#oo##ba##r## aaaBL##foob##ar' ),
        ( [ 3, 13],     '##afoo##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 5, 17],     'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),
        ( [ 5, 17],     'BLAH#fo##o##ba###r aaBLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r###BLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r#BL##foob##ar' ),
    ]
    for exp, t in tests:
        span = do_it('foobar', t, verbose = True)
        if exp != span:
            print '\n0123456789012345678901234567890123456789'
            print t
            print n
            print dict(got = span, exp = exp)

main()
FMc
  • 41,963
  • 13
  • 79
  • 132
  • Hello. I'm sorry to inform you that your code doesn't work at all. It's a pretty subtle algorithm, but it's too subtle: it took me a long time before light came in my mind and before I could find what corrections to do to make it operative, and you didn't succeed to write it completely OK. – eyquem Jul 04 '13 at 17:37
  • In fact, it's because I examined your code that I realized that a previous code of mine was also wrong and I corrected it. So there are two answer of me in this thread: one that contains my own code, and another in which I explained why and how your subtle code doesn't work completely well. I'm still surprised that people post codes that doesn't work. – eyquem Jul 04 '13 at 17:38
  • @eyquem Thanks for noticing the problem. I have fixed the algorithm (but I don't really like it). – FMc Jul 04 '13 at 20:36
  • I'm impressed bu your ability to rewrite a new code so rapidly. But it still has several failings. - It is more complex. I will need a long examination to understand the algorithm. - By the way, why did you write another code while your previous one, after the corrections I said, works well ? - It still manages the correspondance between a clean text and the original text in which there are a particular sign; not between a clean text and its original XML document. - It runs 65 times slower than my solution. I tested. – eyquem Jul 04 '13 at 21:18
  • 1
    @eyquem *Why did you write another code while your previous one, after the corrections I said, works well*. Because I don't think your corrections were correct. :) Perhaps I didn't understand your corrections, but when I tried to implement them, the test cases did not pass. – FMc Jul 04 '13 at 21:42
  • @eyquem ... and because I had a feeling there was a better way to do this, as shown in the new version of the code. – FMc Jul 04 '13 at 22:46
  • And you wrote a third code, wow ! I confess I am unable to follow. For three reasons. - One is that the principle of your algorithm is complex. See in my answer about your code the edit in which I explain that I even no more understand the algorithm of your very first code ! – eyquem Jul 05 '13 at 09:43
  • - The second reason is that your code don't give spans in an original XML document having tags, but in an intermediary state of the original text after having replaced the tags with **#** signs. That doesn't answer to the OP's question in my opinion, even if he seems to be satisfied with the answer of mishik which is far under yours and mine. – eyquem Jul 05 '13 at 09:44
  • And - The third reason is that my own code has a more comprehensible algorithm, it exactly answers to the OP's question, and it runs 65 times faster. -- That said, I would be very interested to know what do you refer to when saying that the test cases did not pass on your code corrected by me. – eyquem Jul 05 '13 at 09:47
1

Another simpler solution is to remove the pound keys with

your_string.replace('#', '')

And test your regex (without all the #*) against the string returned by replace.

neumino
  • 4,342
  • 1
  • 18
  • 17
  • No, I want to keep these pound keys, removing them will affect the integrity of the text that I am searching, which I need for further steps – hmghaly Jun 29 '13 at 16:01
  • 2
    `your_string.replace('#', '')` doesn't overwrite your string but create a new one so you can keep the old one for later and use a simpler regex on the new returned string. – neumino Jun 29 '13 at 16:04
1

The backtracking catastrophe may be caused because your regex contains multiple instances of the pattern #*\\s*#*: Each of these will match any block of repeated #, but it can match the same text in multiple ways. When you have several of these patterns in your regex, the numbers of possibilities multiply.

Are your searching in a larger body of text? If so, does the text contain phrases which coincide with the beginning of your search text? If so, the regex engine matches the beginning of the pattern, and on finding a mismatch, starts backtracking.

Note that the text framework ################## for is not matched by the regex f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r because of an unmatched space characters.

Possible solutions with regexes:

1 Use possessive quantifiers instead of the standard greedy quantifiers. Unfortunately, according to this page, Python does not support possessive quantifiers.

2 Replace the pattern #*\\s*#* with (#|\\s)*, which will reduce the number of ways your regex can match a text. Note that this changed regex can match more than your original text (specifically, the suggested pattern will match the text ## ## ## which the original pattern does not match).

Christian Semrau
  • 8,913
  • 2
  • 32
  • 39
1

In a previous answer, I used re and difflib modules, and your principle of replacing every tag with a character.
But I realized that your problem can be solved using only re and without having to do substitution with an arbitrary character.

Import

import re

Data

I used tuples to be able to display data in a more readable form during execution

Note that I slightly modified the data to avoid some problems:
put only one blank between framework and for the period,
major P at Programm 7 in the two strings, etc

Norte also that I added a sequence of characters ### in phrase and xmltext (in front of the date 2014-2015) to show that my code still works in this case. Other answers don't manage this eventuality.

Phrase

tu_phrase = ('Relationship to the ',
             'strategic framework ',
             'for the period ###2014-2015',
             ': Programme 7, Economic and Social Affairs, ',
             'subprogramme 3, expected accomplishment (c)')
phrase = ''.join(tu_phrase)

XML text

tu_xmltext = ('EEEEEEE',
              '<w:rPr>',
              'AAAAAAA',
              '</w:rPr><w:t>',
              'Relationship to the ',
              '</w:t></w:r><w:r>',
              '<w:rPr><w:i/>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>'
              'strategic framework ',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076">',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              'for the period ###2014-2015',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              ': Programme 7, Economic and Social Affairs, ',
              'subprogramme 3, expected accomplishment (c)',
              '</w:t>',
              '321354641331')
xmltext = ''.join(tu_xmltext)

The working functions

The function olding_the_new(stuvw , pat_for_sub) returns a list of triples (pmod,w,pori)
expressing correspondance of positions of common sequences
in stuvw and re.sub(pat_for_sub, stuvw).
These sequences are the ones in stuvw that aren't catched by group(1) of pat_for_sub :
- (pmod,w) describes a sequence in re.sub(pat_for_sub, stuvw)
- pmod is its position in re.sub(pat_for_sub, stuvw)
- w is its width [it's the same in re.sub(pat_for_sub, stuvw) and stuvw]
- pori is the position of this sequence in original stuvw

def olding_the_new(stuvw,pat_for_sub):
    triples = []
    pmod = 0 # pmod = position in modified stuvw,
             # that is to say in re.sub(pat_for_sub,'',stuvw)
    for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),
                           stuvw):
        if mat.group(1):
            triples.append((pmod,mat.end()-mat.start(),mat.start()))
            pmod += mat.end()-mat.start()
    return triples


def finding(LITTLE,BIG,pat_for_sub,
            olding_the_new=olding_the_new):
    triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)
    modBIG = re.sub(pat_for_sub,'',BIG)
    modLITTLE = re.escape(LITTLE)
    for mat in re.finditer(modLITTLE,modBIG):
        st,nd = mat.span() # in modBIG
        sori = -1 # start original, id est in BIG
        for tr in triples:
            if st < tr[0]+tr[1] and sori<0:
                sori = tr[2] + st - tr[0] 
            if nd<=tr[0]+tr[1]:
                yield(sori, tr[2] + nd - tr[0])
                break

Execution

if __name__ == '__main__':
    print ('---------- phrase ----------\n%s\n'
           '\n------- phrase written in a readable form --------\n'
           '%s\n\n\n'
           '---------- xmltext ----------\n%s\n'
           '\n------- xmltext written in a readable form --------\n'
           '%s\n\n\n'
           %
           (phrase  , '\n'.join(tu_phrase),
            xmltext , '\n'.join(tu_xmltext))    )

    print ('*********************************************************\n'
           '********** Searching for phrase in xmltext **************\n'
           '*********************************************************')

    spans = finding(phrase,xmltext,'</?w:[^>]*>')
    if spans:
        for s,e in spans:
            print ("\nspan in string 'xmltext' :  (%d , %d)\n\n"
                   'xmltext[%d:%d] :\n%s'
                   % (s,e,s,e,xmltext[s:e]))
    else:
        print ("-::: The first string isn't in second string :::-")

RESULT

*********************************************************
********** Searching for phrase in xmltext **************
*********************************************************

span in string 'xmltext' :  (34 , 448)

xmltext[34:448] :
Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>for the period ###2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

Nota Bene

My code can't detect a phrase in an XML document when the sequences of whitespaces between two words are not exactly the same in phrase and in the XML text.
I tried to obtain this possibility, but it's too much complicated.
In your example, in the XML sequence you show, there's a blank between strategic framework and the following tags, and another blank between these tags and the following for the period. In this condition, my code couldn't work (I doubt that other answers can do better on this point), then I used an xmltext without a blank in front of for the period.

On the other side, my code doesn't use a replacement character, then any character may be in the XML document and the phrase, without having any problem with a character that would be in them while used as the replacement character.

My code directly gives span in the original XML document, not in an intermediary text modified with a replacement character.

It gives all the occurences of phrase in the XML document, not only the first.

................................

With following data:

print ('\n*********************************************************\n'
       "********* Searching for 'foobar' in samples *************\n"
       '*********************************************************')

for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',
               '#fo##o##ba###r## aaaaaBLfoob##arAH',
               'BLAHHfo##o##ba###r   BLfoob##arAH',
               'BLAH#fo##o##ba###rBLUHYfoob##arAH',
               'BLA# fo##o##ba###rBLyyyfoob##ar',
               'BLA# fo##o##ba###rBLy##foob##ar',
               'kjhfqshqsk'):
    spans = list(finding('foobar',xample,'#'))
    if spans:
        print ('\n%s\n%s'
               %
               (xample,
                '\n'.join('%s  %s'
                          % (sp,xample[sp[0]:sp[1]])
                          for sp in spans))
               )
    else:
        print ("\n%s\n-::: Not found :::-" % xample)

the results are:

*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************

fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13)  fo##o##ba###r
(23, 31)  foob##ar

#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14)  fo##o##ba###r
(24, 32)  foob##ar

BLAHHfo##o##ba###r   BLfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLy##foob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

kjhfqshqsk
-::: Not found :::-

..........................................

With the following code, I examined your question:

import urllib

sock = urllib.urlopen('http://stackoverflow.com/'
                      'questions/17381982/'
                      'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()

i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')

print 'i == %d   j== %d' % (i,j)
print repr(r[i:j])


print
print 'j == %d   k== %d' % (j,k)
print repr(r[j:k])

The result is:

i == 10408   j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n  Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n  \n  <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '

j == 10714   k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n  <p>Relationship to the strategic framework for the period 2014-2015:\n  programme 7, Economic and Social Affairs, subprogramme 3, expected\n  accomplishment (c)</p>\n</blockquote>\n\n<p>'

Note the additional \n in front of programme 7 , additional \n <p> in front of accomplishment, the difference Programme 7 and programme 7 , and the presence of two blanks between framework and for the period in the string framework ################## for the period
This may explain your difficulties with your example.

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • 1
    It's pleasing effort because I like Python. And it is a simple code indeed. - But you don't seem particularly satisfied to have now a solution that gives exactly the result you asked for. – eyquem Jul 01 '13 at 21:56
  • The thing is that there was already an answer that I found to work already and I accepted it, so I really appreciate you giving such good effort to come up with another answer :) thank you – hmghaly Jul 02 '13 at 01:12
  • _"What I need basically is the span of this phrase within the actual xml document"_ I wrote my second code (the one above) with this remark of you in mind: one passes the phrase and the xmltext to **finding()** function, and one gets the span(s) in the xml document. I don't see that mishik's or anyone else's answer does the same – eyquem Jul 02 '13 at 11:42
  • My code doesn't use a replacement character like **#** and has no problem when there already are characters **#** in the XML document. What does happen in this case with other answers ?? – eyquem Jul 02 '13 at 11:48
1

The following code shows that FMc 's code doesn't work.

The line
from name_of_file import olding_the_new,finding refers to my code in my personal answer in this thread to this question.
* Give the name name_of_file to the file containing the script of my code (lying in my other answer in this thread) and it will run.
* Or if you are annoyed to copy-paste my code, just comment this line of import, and the following code will run because I put a try-except instruction that will react correctly to the absence of olding_the_new and finding

I used two ways to verify the results of FMc 's code:
-1/ comparing the span returned by his code with index of 'f' and index of 'r', as we search for phrase 'foobar' and I managed that there are no f and r other than the ones in foobar
-2/ comparing with the first span returned by my code, hence the need of the above import from name_of_file

Nota Bene

If disp = None is changed to disp == True, the execution displays intermediary results that help to understand the algorithm.

.

import re
from name_of_file import olding_the_new,finding

def main():
    # Two versions of the text: the original,
    # and one without any of the "#" markers.
    for text_orig  in ('BLAH ##BLAH fo####o##ba###r## BL##AH',
                       'jkh##jh#f',
                       '#f#oo##ba###r##',
                       'a##xf#oo##ba###r##',
                       'ax##f#oo##ba###r##',
                       'ab###xyf#oo##ba###r##',
                       'abx###yf#oo##ba###r##',
                       'abxy###f#oo##ba###r##',
                       'iji#hkh#f#oo##ba###r##',
                       'mn##pps#f#oo##ba###r##',
                       'mn##pab###xyf#oo##ba###r##',
                       'lmn#pab###xyf#oo##ba###r##',
                       'fo##o##ba###r## aaaaaBLfoob##arAH',
                       'fo#o##ba####r## aaaaaBLfoob##ar#AH',
                       'f##oo##ba###r## aaaaaBLfoob##ar',
                       'f#oo##ba####r## aaaaBL#foob##arAH',
                       'f#oo##ba####r## aaaaBL#foob##ar#AH',
                       'foo##ba#####r## aaaaBL#foob##ar',
                       '#f#oo##ba###r## aaaBL##foob##arAH',
                       '#foo##ba####r## aaaBL##foob##ar#AH',
                       '#af#oo##ba##r## aaaBL##foob##ar',
                       '##afoo##ba###r## aaaaaBLfoob##arAH',
                       'BLAHHfo##o##ba###r aaBLfoob##ar#AH',
                       'BLAH#fo##o##ba###r aaBLfoob##ar',
                       'BLA#Hfo##o##ba###r###BLfoob##ar',
                       'BLA#Hfo##o##ba###r#BL##foob##ar',
                       ):

        text_clean = text_orig.replace('#', '')
        # Collect data on the positions and widths
        # of the markers in the original text.
        rgx     = re.compile(r'#+')
        markers = [(m.start(), len(m.group()))
                   for m in rgx.finditer(text_orig)]

        # Find the location of the search phrase in the cleaned text.
        # At that point you'll have all the data you need to compute
        # the span of the phrase in the original text.
        search = 'foobar'
        try:
            i = text_clean.index(search)
            print ('text_clean == %s\n'
                   "text_clean.index('%s')==%d   len('%s') == %d\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,i,search,len(search),
                      text_orig,
                      markers))
            S,E = compute_span(i, len(search), markers)
            print "span = (%d,%d)  %s %s     %s"\
                  % (S,E,
                     text_orig.index('f')==S,
                     text_orig.index('r')+1==E,
                     list(finding(search,text_orig,'#+')))
        except ValueError:
            print ('text_clean == %s\n'
                   "text_clean.index('%s')   ***Not found***\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,
                      text_orig,
                      markers))
        print '--------------------------------'

.

def compute_span(start, width, markers):
    # start and width are in expurgated text
    # markers are in original text
    disp = None # if disp==True => displaying of intermediary results
    span_start = start
    if disp:
        print ('\nAt beginning in compute_span():\n'
               '  span_start==start==%d   width==%d'
               % (start,width))
    for s, w in markers: # s and w are in original text
        if disp:
            print ('\ns,w==%d,%d'
                   '   s+w-1(%d)<start(%d) %s'
                   '   s(%d)==start(%d) %s'
                   % (s,w,s+w-1,start,s+w-1<start,s,start,s==start))
        if s + w - 1 < start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # the following if-else section is justified to be used
            # only after correction of the above line to this one:
            # if s+w-1 <= start or s==start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwm
            if s + w - 1 <= start and disp:
                print '  1a) s + w - 1 (%d) <= start (%d)   marker at left'\
                      % (s+w-1, start)
            elif disp:
                print '  1b) s(%d) == start(%d)' % (s,start)
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # Situation: marker fully to left of our text.
            # Adjust our start points rightward.
            start      += w
            span_start += w
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
        elif start + width - 1 < s:
            if disp:
                print ('  2) start + width - 1 (%d) < s (%d)   marker at right\n'
                       '  break' % (start+width-1, s))
            # Situation: marker fully to the right of our text.
            break
        else:
            # Situation: marker interrupts our text.
            # Advance the start point for the remaining text
            # rightward, and reduce the remaining width.
            if disp:
                print "  3) In 'else': s - start == %d   marker interrupts" % (s - start)
            start += w
            width = width - (s - start)
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
    return (span_start, start + width)

.

main()

Result

>>> 
text_clean == BLAH BLAH foobar BLAH
text_clean.index('foobar')==10   len('foobar') == 6
text_orig  == BLAH ##BLAH fo####o##ba###r## BL##AH
markers  == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]
span = (12,26)  True False     [(12, 27)]
--------------------------------
text_clean == jkhjhf
text_clean.index('foobar')   ***Not found***
text_orig  == jkh##jh#f
markers  == [(3, 2), (7, 1)]
--------------------------------
text_clean == foobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r##
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]
span = (0,11)  False False     [(1, 13)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == a##xf#oo##ba###r##
markers  == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,16)  False True     [(4, 16)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == ax##f#oo##ba###r##
markers  == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,15)  False False     [(4, 16)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == ab###xyf#oo##ba###r##
markers  == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abx###yf#oo##ba###r##
markers  == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,18)  False False     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abxy###f#oo##ba###r##
markers  == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == ijihkhfoobar
text_clean.index('foobar')==6   len('foobar') == 6
text_orig  == iji#hkh#f#oo##ba###r##
markers  == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnppsfoobar
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == mn##pps#f#oo##ba###r##
markers  == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnpabxyfoobar
text_clean.index('foobar')==7   len('foobar') == 6
text_orig  == mn##pab###xyf#oo##ba###r##
markers  == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == lmnpabxyfoobar
text_clean.index('foobar')==8   len('foobar') == 6
text_orig  == lmn#pab###xyf#oo##ba###r##
markers  == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo##o##ba###r## aaaaaBLfoob##arAH
markers  == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,9)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo#o##ba####r## aaaaaBLfoob##ar#AH
markers  == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f##oo##ba###r## aaaaaBLfoob##ar
markers  == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,11)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##arAH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##ar#AH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == foo##ba#####r## aaaaBL#foob##ar
markers  == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r## aaaBL##foob##arAH
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]
span = (0,11)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #foo##ba####r## aaaBL##foob##ar#AH
markers  == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]
span = (0,12)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaBLfoobar
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == #af#oo##ba##r## aaaBL##foob##ar
markers  == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]
span = (2,10)  True False     [(2, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaaaBLfoobarAH
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == ##afoo##ba###r## aaaaaBLfoob##arAH
markers  == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]
span = (1,14)  False True     [(3, 14), (24, 32)]
--------------------------------
text_clean == BLAHHfoobar aaBLfoobarAH
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == BLAHHfo##o##ba###r aaBLfoob##ar#AH
markers  == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobar aaBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLAH#fo##o##ba###r aaBLfoob##ar
markers  == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]
span = (4,16)  False False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r###BLfoob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r#BL##foob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
>>> 

.

---------------------------------------------

The code of FMc is very subtle, I had a long hard time to understand its principle and then to be able to correct it.
I will let anybody the task to understand the algorithm. I only say the corrections required to make the code of FMc to work correctly:

.

First correction:

if s + w - 1 < start:
# must be changed to  
if s + w - 1 <= start or (s==start):

EDIT

In my initial present answer,
I had written ... or (s<=start).
That was an error of me, in fact I had the intention to write
.. or (s==start)

NOTA BENE about this EDIT:

This error had no consequence in the code corrected with the two corrections I describe here to correct the initial code of FMc (the very first one, because at present he has changed it two times).
Indeed, if you correct the code with the two corrections, you'll obtain correct results with all the 25 examples taken for text_orig, as well with ... or (s <= start) as with ... or (s==start).
So I thought that the case in which s < start is True could never happen when the first condition s+w-1 <= start is False, presumely based on the fact that w is always greater than 0 and some other reason due to the configuration of the markers and non-marker sequences.....
So I tried to find the demonstration of this impression... and I failed.
Moreover, I reached a state of mind in which I even no more understand the algorithm of FMc (the very first one before any edit he did) !!
Despite this, I let this answer as it is, and I post, at the end of this answer, the explanations trying to explain why these corrections are needed.
But I warn: the very first algorithm of FMc is very outlandish and difficult to comprehend because it does comparison of indices belonging to two different strings, one which is text_orig with markers #### and another one cleaned of all these markers.... and now I am no more convinced that may have a sense....

.

Second correction::

start += w
width = width - (s - start)
# must be changed to   
width -= (s-start) # this line MUST BE before the following one
start = s + w # because start += (s-start) + w

-------------------

I am stunned that 2 people upvoted the answer of FMc though it gives a wrong code. It means that they upvoted an answer without having tested the given code.

----------------------------------------

.

EDIT

Why must the condition if s + w - 1 < start: must be changed to this one:
if s + w - 1 <= start or (s==start): ?

Because it may happen that s + w - 1 < start should be False and s equals start together.
In this case, the execution goes to the else section and executes (in corrected code):

width -= (s - start)
start = s + w

Consequently, width doesn't change while it evidently should change when we see the sequence concerned.

This case may occur at the moment the first marker is examined, as with the following sequences:

'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0  
'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2    
'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4  
'#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0  
'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4

With the following ones, it occurs for the examination of the second marker:

'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  
'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  

It can be understood better by executing my code with disp = True setted.

When s + w - 1 <= start is verified, the fact that s may equal start isn't troublesome because the execution doesn't go to the else section, it goes to the first section in which there's only the addition of w to s and to start.
But when s + w - 1 <= start is False while s equals start, the execution goes to the else section where the execution of instruction width -= (s-start) doesn't change anything to the value of width and that's troublesome.
So the condition or (s==start) must be added to prevent this else destination, and it is needed to put it after an or to prevent this destination even when s+w-1 <= start is False, which can happen as some examples show it.

.

Concerning the fact that the instruction s+w-1 < start must be changed to s+w-1 <= start (with =),
it's because of the case w==1 corresponding to 1 character # only ,
as for the cases
mn##pps#f#oo##ba###r## (second marker)
and BLAH#fo##o##ba###r (first marker).

eyquem
  • 26,771
  • 7
  • 38
  • 46
0

Without using regex you can obtain what you want doing:

text.replace('#','').replace('  ',' ')
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
0

Depth first search with XML Parser ?

Maybe remember the position in the xml document where the text node was found, for later reverse lookup. You actual goal is still unclear.

sleeplessnerd
  • 21,853
  • 1
  • 25
  • 29