1

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.

Regex:

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

Purpose of this regex:

INPUT:

CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

Should match:

CODE876
CODE223
CODE657
CODE697

and replace occurrences with

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743

Should Not match:

code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665

FINAL OUTPUT

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

EDIT and UPDATE 1

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.

EDIT AND UPDATE 2

f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()

s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1

INPUT

CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345

CODE234

CODE333

OUTPUT

<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>

<a href="http://productcode/CODE234">CODE234</a>

<a href="http://productcode/CODE333">CODE333</a>

Regex works for Raw input, but not for string input from a text file.

See Input 4 and 5 for more results http://ideone.com/3w1E3

c_prog_90
  • 951
  • 20
  • 43
  • general disclaimer about **not** [processing HTML/XHTML/XML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). –  Jul 20 '11 at 20:25
  • What should it do with `code876`? `CODE8765`? – John Machin Jul 20 '11 at 22:45
  • @thinkcool: edit your question to include code876 and CODE8765 examples. Note: your pattern does not attempt to restrict the number of digits after CODE. Also as suggested, use re.VERBOSE so that you can better see for yourself what it is doing. – John Machin Jul 20 '11 at 23:17
  • @thinkcool: CODE69743 in desired output but not in input – John Machin Jul 20 '11 at 23:39
  • @thinkcool: what to do with input of CODE123XYZ? – John Machin Jul 21 '11 at 04:35
  • @John Machin Replace it with http:// productcode/CODE123XYZ – c_prog_90 Jul 21 '11 at 05:45

4 Answers4

5

Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.

import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy

Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:

pattern4 = r'''
    ^
    (?i)
    (
        (?:
            (?!http://)
            (?!testing[0-9])
            (?!example[0-9])
            . #### what is this for?
        )*?
    ) ##### end of capturing group 1
    (CODE[0-9]{3}) #### not in capturing group 1
    (?!</a>)
    '''
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • 2
    @thinkcool: This answer answers correctly the question that you asked. Is it not worth *at least* an upvote? – John Machin Jul 20 '11 at 23:19
  • I use the regex posted by you along with 'code' prog = re.compile(pattern4,re.VERBOSE) result = prog.match(mytext) print result 'code' Am I doing the right way. I get no match for my input – c_prog_90 Jul 20 '11 at 23:52
  • @thinkcool: the regex `pattern4` posted by me is functionally equivalent to yours, together with comments hinting why it *doesn't* work. You are invited to try to work out for yourself how to fix it. – John Machin Jul 21 '11 at 00:24
2

Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.

On a side note, I don't see any reason to use three separate lookaheads, as you did here:

(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?

Just OR them together:

(?:(?!http://|testing[0-9]|example[0-9]).)*?

EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.

s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)

see it in action one ideone.com

Is that what you're after?


EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.

(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))

The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.

here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • When I tried this regex, the regex matched all strings containing CODE[0-9]{3} and replaced them with http: //productcode/CODE[0-9]{3}. Is there a problem with my matching group – c_prog_90 Jul 21 '11 at 00:12
  • I updated my answer to include a full solution. Let me know what you think. – Alan Moore Jul 21 '11 at 01:27
  • Thanks a lot..This works for me. But, Since I am parsing the contents of a text file, I read the contents of that text file to string and use this regex. It replaces all the occurrences of CODE[0-9]{3} with the http://productcode/CODE[0-9]{3}. It does not take care of the special cases. – c_prog_90 Jul 21 '11 at 04:29
  • This doesn't handle the OP's CODE69743 case. – John Machin Jul 21 '11 at 04:33
  • @thinkcool: You say it works for you, but it does not take care of the "special cases"? What special cases?? It works for all your test cases EXCEPT where CODE is followed by more than 3 digits. – John Machin Jul 21 '11 at 04:42
  • @thinkcool: Consider editing your question to show your code, your test file, and the output. – John Machin Jul 21 '11 at 04:44
  • @John Machin. Actually I am using this regex to do find and replace in a text file. I get the contents of the text file as a string to this regex. It replaces all the occurrences of the CODE[0-9]{3} with http: //productcode/CODE[0-9]{3} . It does not take care of any of the special cases(see input and output in question) when using this regex with a string taken from a text file. The reason is that the text file does not contain these CODES in a orderly manner. They are all jumbled like "CODE123 CODE345 example1CODE345 http://www.code.com/CODE987". – c_prog_90 Jul 21 '11 at 04:50
  • @John Machin So in the regex given by Alan, I removed '^' and tried, but it did not work. – c_prog_90 Jul 21 '11 at 04:51
  • If the `http://` part is present, can we assume it's already in a `...` tag? That would make this job a lot easier--and we need all the help we can get! ;) – Alan Moore Jul 21 '11 at 08:49
  • @Alan It can be assumed the way you told :).The regex you provided worked for all raw inputs, but not when passed as an entire string.What could be holding this? – c_prog_90 Jul 21 '11 at 11:15
0

The only problem I see is that you replace using the wrong capturing group.

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)  
                       ^                                                        ^                                                        ^
                    first capturing group                                  second one                                         using the first group

Here I made the first one also a non capturing group

^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)

See it here on Regexr

stema
  • 90,351
  • 20
  • 107
  • 135
  • 1
    I get the same error when I use your modified regex.Is python uses a different Regex engine than that of RegExr – c_prog_90 Jul 20 '11 at 22:25
  • I removed (?-i) as told by John and I no more get the error, but I am not able to match with this regex. – c_prog_90 Jul 20 '11 at 23:03
0

For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153