0

I am new to HTTP and Regex. I have a piece of code which I have ported to Delphi which works partially. The exception 'lookbehind not of fixed length' is raised on a particular statement:

'(?<=image\\?c=)[^\"]+'

The statement is there to extract image link from a html form. After some research here and on the web, I have come to understand that the '+' at the end causes this in some implementations of Regex. Which I couldn't find was how can I change it to work in Delphi's implementation. As the code works in C#, can somebody help and explain?

Umair Ahmed
  • 2,420
  • 1
  • 21
  • 40
  • 2
    Why don't you use an html parser? – David Heffernan Feb 15 '14 at 11:36
  • @DavidHeffernan do you have any suggestions? I have not heard of any before. Its my 3rd day with html. – Umair Ahmed Feb 15 '14 at 11:51
  • 2
    http://stackoverflow.com/questions/2733972/best-lightweight-html-parser-for-delphi – David Heffernan Feb 15 '14 at 11:56
  • The error will be because of the optional backslash, the `+` is fine because it's not in the look behind. – OGHaza Feb 15 '14 at 12:02
  • @OGHaza Can you give a suitable replacement and/or additional filter for that or explain what is going on? – Umair Ahmed Feb 15 '14 at 12:14
  • 1
    Umair, I assume this is meant to match `image?c=...` right? What happens if you change the double backslash in the middle to just a single backslash? (because the error sounds like you only need one to escape the question mark - though I have almost never written regex in delphi so I can't say for sure) – OGHaza Feb 15 '14 at 12:22
  • 1
    @OGHaza I really don't know C# either, this was piled on me after the original developer said: "I don't have time, take the source code and fix it yourself!". The expression is a direct conversion from C#, conversion being adding '' instead of '. – Umair Ahmed Feb 15 '14 at 12:26
  • @OGHaza yes you have correctly interpreted the purpose, I'll try it now. – Umair Ahmed Feb 15 '14 at 12:26
  • 1
    Umair, David's link above to a HTML parser is well worth checking out. Also, [this famous question about using regular expressions to parse HTML may amuse you.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) If you want to really fix your co-worker's code, fix it to parse HTML using a HTML parser instead of regexes. – David Feb 15 '14 at 13:00
  • 1
    @DavidM Yes, I intend to do just that, it appears to be a lot better than scavenging parts of HTML script and manually cutting it down to the required bits. But for the immediate future, I need this piece up and running as we are losing money, minute by minute. – Umair Ahmed Feb 15 '14 at 13:05
  • Whoever wrote the C# should learn about verbatim strings – David Heffernan Feb 15 '14 at 14:54

1 Answers1

4

The lookbehind section doesn't have fixed length. That has nothing to do with the + at the end. The lookbehind portion is (?<=image\\?c=). You copied that from C#. In C#, the regex wants to look for a literal question mark. That's a special character in regex, so it needs a backslash in front of it. Backslash is special in C# strings, though, so that backslash needs another backslash, all just to represent a single question mark.

In Delphi strings, backslashes aren't special, so the two of them are treated as a literal backslash to search for in the regex. The question mark isn't escaped, so the Delphi regex treats it as an instruction to make the literal backslash optional. The optional character makes the lookbehind have variable length.

To solve this, simply remove one backslash.

You can also remove the one before the quotation mark, but it should have no effect since quotation marks aren't special in regex.

Even if you use an HTML parser to identify HTML element that contains this URL fragment, you may still need the right regex to recognize which HTML element is your target.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467