RegEx Parsing for HTML attributes - one specific string

Question

With Delphi Rio, I am using an HTML/DOM parser. I am traversing the various nodes, and the parser is returning attributes/tags. Normally these are not a problem, but for some attributes/tag, the string returned includes multiple attributes. I need to parse this string into some type of container, such as a stringlist. The attribute string the parser returns already has the '<' and '> removed.

Some examples of attribute strings are:

data-partnumber="BB3312" class=""
class="cb10"
account_number = "11432" model = "pay_plan"

My end result that I want is a StringList, with one or more name=value pairs. I have not used RegEx to any real degree, but I think that I want to use RegEx. Would this be a valid approach? For a RegEx pattern, I think the pattern I want is

\w\s?=\s?\"[^"]+"

To identify multiple matches within a string, I would use TRegex.Matches. Am I overlooking something here that will cause me issues later on?

*** ADDITIONAL INFO *** Several people have suggested to use a decent parser. I am currently using the openSource HTML/DOM parser found here: https://github.com/sandbil/HTML-Parser In light of that, I am posting more info... here is an HTML Snippet I am parsing. Look at the line I have added *** at the end. My parser is returning this as

Node.AttributeText= 'data-partnumber="B92024" data-model="pay_as_you_go" class=""  '

Would a different HTML DOM parser return this as 3 different elements/attributes? If so, can someone recommend a parser?

  <section class="cc02 cc02v0" data-trackas="cc02" data-ocomid="cc02">
    <div class="cc02w1">
      <div class="otable otable-scrolling">
        <div class="otable-w1">
          <table class="otable-w2">
            <thead>
              <tr>
                <th>Product</th>
                <th>Unit Price</th>
                <th>Metric</th>
              </tr>
            </thead>
            <tbody>         
              <tr>
                <td class="cb152title"><div>MySQL Database for HeatWave-Standard-E3</div></td>
                <td><div data-partnumber="B92024" data-model="pay_as_you_go" class="">$0.3536<span></span></div></td> *****
                <td><div>Node per hour</div></td>
              </tr>
              <tr data-partnumber="B92426">
                <td class="cb152title">MySQL Database—Storage</td>
                <td><span data-model="pay_as_you_go" class="">$0.04<span></span></span></td>
                <td>Gigabyte storage capacity per month</td>
              </tr>             
            </tbody>
          </table>
        </div>
      </div>
    </div>
  </section>

An alternative would be to replace the faulty HTML parser with a working one. — Andreas Rejbrand, Mar 24 '21 at 18:17
If you haven't already, take a look at https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg. @AndreasRejbrand's suggestion might be a better way to go. — MartynA, Mar 24 '21 at 19:28
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Delphi Coder, Mar 24 '21 at 20:00
Any *decent* DOM parser should already be parsing the attributes for you, exposing them as separate nodes that you can easily enumerate through and access their name and value strings individually. If your parser is not, then either the parser is faulty, or the HTML is malformed in a way that prevents proper parsing. Hard to say, since you did not show the actual HTML, or specify which DOM parser you are using. — Remy Lebeau, Mar 24 '21 at 20:16
A DOM Parser should parse *properly formed* HTML. A RegEx could help interpret the strings you have provided, irrespective of how you get them. Note that this question is about using a RegEx to parse name value pairs, not to parse HTML or XML. — Rob Lambden, Mar 25 '21 at 11:03
@RemyLebeau - I added some additional info to my question, including HTML snippet. Since you mentioned using a different DOM parser, can you recommend one? — user1009073, Mar 25 '21 at 11:42

Remy Lebeau · Accepted Answer · 2021-03-25T16:38:50.633

1

The documentation for the parser you are using says TDomTreeNode has an AttributesText property that is a "string with all attributes", which you have shown examples of. But it also has an Attributes property that is "parsed attributes" provided as a TDictionary<string, string>. Have you tried looking into the values of that property yet? You should not need to use a RegEx at all, just enumerate the entries of the TDictionary instead, eg:

var 
  Attr: TPair<string, string>;

for Attr in Node.Attributes do begin
  // use Attr.Key and Attr.Value as needed...
end;

edited Mar 25 '21 at 16:38

answered Mar 25 '21 at 14:33

Remy Lebeau

555,201
31
458
770

I saw that, but for whatever reason, it just did not sink in. Thank you!. Regardless, I learned some more about RegEx. – user1009073 Mar 25 '21 at 16:33

score 0 · Answer 2 · edited Mar 26 '21 at 12:00

(As the OP asked about using a RegEx to parse attribute=value pairs, this answers the question directly, which other users may be looking for in the future.)

RegEx based answer

Using a RegEx is extremely powerful, from the data you have provided you can extract the attribute name and value pairs using:

(\S+)\s*=\s*(\"?)([^"]*)(\2|\s|$)

This uses grouping and can be explained as follows:

The first result group is the attribute name (it matches non-whitespace characters)

The second result group is an enclosing " if present, otherwise an empty string

The third result group is the value of the attribute

As RegExes can be run recursively you can use MatchAgain to see if there's another match and so read all of the attributes recursively.

procedure ParseAttributes(AInput: String; ATarget: TStringList);
var
  LMatched: Boolean;
begin
  pRegEx:=TPerlRegEx.Create;
  try
    pRegEx.RegEx:='(\S+)\s*=\s*(\"?)([^"]*)(\2|\s|$)';
    pRegEx.Subject:=AInputData;
    LMatched:=pRegEx.Match;
    while LMatched do
    begin
      ATarget.Add(pRegEx.Groups[1].'='+'"'+pRegEx.Groups[3]+'"');
      LMatched:=pRegEx.MatchAgain;
    end;
  finally
    pRegEx.Free;
  end;
end;

Disclaimer: I haven't tried compiling that code, but hopefully it's enough to get you started!

Practical Point: With respect to the actual problem you posed with your DOM parser - this is a task that there are existing solutions for so a practical answer to solving the problem may well be to use a DOM parser that works! If a RegEx is something you need for whatever reason this one should do the job.

You have used PerlRegEx instead of the regular 'Delphi' TRegEx. Any particular reason? Also, I added some additional info to my question, including HTML snippet. Since you mentioned using a different DOM parser, can you recommend one? — user1009073, Mar 25 '21 at 11:40
TPerlRegEx is in unit System.RegularExpressionsCore which is `used` by the System.RegularExpressions unit (which is where TRegEx is). Both of them use the same System.RegularExpresssionsAPI unit. I've always used TPerlRegEx, to be honest I hadn't noticed you'd mentioned TRegEx in particular. — Rob Lambden, Mar 25 '21 at 11:51
As for parsing the DOM - not all valid HTML is well formed. Without knowing more about what you need I can't really make a recommendation. If your HTML is in fact valid XML (not all HTML is) then any of the system provided XML parsers should be fine. If not I suggest googling to see what others use, I know there are quite a few out there but I've never had the need to parse a whole document into a DOM that's not valid XML. — Rob Lambden, Mar 25 '21 at 11:54

RegEx Parsing for HTML attributes - one specific string

2 Answers2