4

I'm trying to use regex on Delphi to regex a HTML and get some data.

My objective is create a query string with the follow sintax:

?namedGroup1=valueNamedGroup1&namedGroup2=valueNamedGroup2

I have n Array of regex:

array[0] = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"';

My html:

<h1>bla bla bla</h1> <div id="home">

If I apply this regex using the built in regex in PHP it will return an associative array

RegArray[0] = '<div id="home">'
RegArray['id'] = 'home'

if I do a foreach I easily get the list of the named groups and I can create my querystring:

?id=home

The thing is that I don't know if the regex will match the named group ID or Name and I need to know that.

Delphi only return a simple array

RegArray[0] = '<div id="home">'
RegArray[1] = 'home'  // ID or NAME?

So, how do I get the named Group and the named Group Value?

here it is my code:

var RegEx: TRegEx;
begin
 RegEx := TRegEx.Create(array[0], [roIgnoreCase,roMultiline]);
 Match := RegEx.Match(html);
 if (Match.Success) then
 begin
   //get the group here.
 end;

I also tried this class: http://www.regular-expressions.info/delphi.html

But no success

Patrick Nogueira
  • 186
  • 2
  • 11
  • I don't know if you saw this [thread and if it helps](http://stackoverflow.com/questions/7906974/named-capture-group-in-regex?rq=1) – Merlin W. Dec 25 '13 at 04:14
  • 1
    you may try http://www.yunqa.de/delphi/doku.php/products/regex/index - it is nopt only the component but also an interactive editor, allowing you trial-and-error tuning of patterns. OTOH if you have to parse HTML - why not to use HTML parser ? that should work faster than all-purpose regexp – Arioch 'The Dec 25 '13 at 07:10
  • Look at [`the answer`](http://stackoverflow.com/a/1732454/960757) of a related post... For parsing HTML you should really use HTML parser, not regular expressions. – TLama Dec 25 '13 at 19:27
  • Guys, I cant use a HTML parse now, I need to use regex. The problem is much more complex that what I posted here, I just tried to simplify. – Patrick Nogueira Dec 25 '13 at 19:41
  • You need to use RegEx but then you are free to use any regex library out there? sounds strange. – Arioch 'The Dec 26 '13 at 06:50
  • Guys, I bought a component that allow me to get the named groups. unfortunately the regex built in delphi does now allow to get it. – Patrick Nogueira Jul 15 '14 at 15:51

3 Answers3

2

I think you made a mistake in your query: look at the last two characters of the pattern - it clearly was unbalanced! Looks like you failed to copy-paste from PHP ;-)

  • yours: <div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"
  • mine: <div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+)")

DI RegExp demo

Using pcre.org engine + interactive editor from http://www.yunqa.de/delphi/doku.php/products/regex/index


I also tried this class: http://www.regular-expressions.info/delphi.html

That page immediately shows another interactive editor that could be used to debug your RegEx program: http://www.regexbuddy.com/test.html

I wonder why didn't you tried to use it...


Still i think some HTML parser would be both faster and more reliable. Consider HTML extracts like

 <!-- <p><div name="bla-bla"> ... </div></p> -->

or like

 <img src="...." alt='Press to insert <div id="123"> to you sample text' />

or like

 <DIV ID="my cool id" />

The topic starter made his own answer below, consisting mostly of questions to me.

The problem is not the Regex,

Just count the quotes and arrows, in which order they are opened and in which they are closed, with pen and paper. You pattern is ( ... " ... ) .... " - it is unbalanced!

is the Delphi.

Delphi the language does not have anything to do with regexps. The libraries/components can do. So that claim has no sense. You may argue that you tested broken libraries, but not the language itself.

My regex with PHP works fine,

That should mean that either you have different regex pattern in PHP (you did not copied here PHP source) or "Problem is in PHP"

Actually we did not saw neither Delphi source nor PHP source.

array[0] = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"'; - is i think not correct line in neither.

So i don't think your code and patterns in PHP program and Delphi program match each other. Show quotes of the real code being used.

the thing is that DELPHI doesn't return me

  1. Again, that just does not makle sense. Delphi is just a language, it does not know a thing about RegEx.
  2. Just above you sawthe screenshot of Delphi-written program using PCRE engine - given the repaired pattern it DOES return both name and value. So the claim is obviously wrong even in vague sense. Delphi DOES return <name, value> pair for it.

Also, I can't change the whole system to use a HTML parser, the regex is already working

Then you need to adapt regex to correctly parse the HTML snippets i shown above.

Arioch 'The
  • 15,799
  • 35
  • 62
  • 2
    May I introduce the _Raabe Conjecture_: "For every problem exists at least one solution that is faster, more reliable and easier to read than a regular expression." – Uwe Raabe Dec 25 '13 at 10:01
  • Ok, let me explain. Today I have a browser extension that use regex to get some values in the HTML, so, today I aready have 60k users that use my extension that is regex based. I don't use a simple regex I use more than 120 regex, I need to scrap more than 30 websites. Now I'm creating an extension .exe that will support any browser. This .exe extension will download my regex list, apply to the websites that are inside the list and send the values to me. just to mention, this is a price comparison extension, not something illegal. Thanks – Patrick Nogueira Dec 27 '13 at 03:46
  • What I tried to tell you about the "delphi doesn't return me" is that the regex CLASS that Delphi use does not return the name of the groups, only the value inside an array. So I lose the reference unfortunately, then, don't worry about the regex that I posted here. – Patrick Nogueira Dec 27 '13 at 03:49
  • As you can see, Delphi-made programs can get the ID/NAME too if only you fix your regexp pattern. Both me and Uwe shown you Delphi programs that does report the ID/NAME=Value. All what was needed for both of us was to debug the regex pattern and fix the obvious bug in it. On the other hand, there is a new development tool that makes a program onsisting of JavaScript and embedded Google Chrome. Maybe for "download web pages and parse them" that would be more appropriate toolbox, and a free one. – Arioch 'The Dec 27 '13 at 08:54
1

TRegEx (from System.RegularExpressions) is a wrapper around TPerlRegEx (from System.RegularExpressionsCore), which is a wrapper around the open source PCRE library.

PCRE of course supports retrieving the names for groups, but both wrappers don't.

Possible solutions:

  • Ask Embarcadero to fix it
  • Access PCRE directly (System.RegularExpressionsAPI)
  • Use one of the two wrappers, but for retrieving the names, hack into their private members to get access to the PCRE memory (pcre_fullinfo(TPerlRegEx.FPattern, ...))
  • Use a better wrapper, i.e. JclPCRE from the open source JEDI Code Library (JCL): Name1:= TJclRegEx.CaptureNames[1];
maf-soft
  • 2,335
  • 3
  • 26
  • 49
0

I am not sure about enumerating named groups, but you can access the group either by its index or by its name:

const
  cRegEx = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+)")';
  cHtml = '<h1>bla bla bla</h1> <div id="home">';
var
  group: TGroup;
  match: TMatch;
  regEx: TRegEx;
begin
  regEx := TRegEx.Create(cRegEx, [roIgnoreCase,roMultiline]);
  match := regEx.Match(cHtml);
  if match.Success then begin
    group := match.Groups['id'];
    Assert(group.Value = 'home');
  end;
end;
Uwe Raabe
  • 45,288
  • 3
  • 82
  • 130
  • You're using my edited pattern too, while topic-starter insists on using his original pattern instead :-P – Arioch 'The Dec 26 '13 at 06:47
  • @Arioch'The, it is just because your pattern is correct. I could easily use the wrong pattern, but then I get a hard time to verify the code inside the if-block – Uwe Raabe Dec 26 '13 at 11:36
  • I just wanted a confirmation from a 3rd uninterested party :-) – Arioch 'The Dec 26 '13 at 14:32