-2

I'm trying to figure how to parse the Rank, Title and URL of a Google search result using Delphi.

Mainly i need to get all the A links and TEXT from an H3 Tag with a specific class name "r".

Here is the function to get the results section of the Google html:

function TForm1.ExtractContainer: TStringList;
var
    Doc : IHTMLDocument3;
    i: Integer;
    Download: IHTMLElement;
    Coll: IHTMLElementCollection;
    Anchor: IHTMLAnchorElement;
    tmp : String;

begin
    Result := TStringList.Create;
    Doc := EmbeddedWB1.Document as IHTMLDocument3;
    Download := Doc.getElementById('center_col') as IHTMLElement;
    tmp := Download.innerHTML;
    result.Text := AnsiReplaceStr(tmp, '<h3 class="r">', '<h3 class="r">'#13#10);

for i := 1 to result.Count -1 do
begin
    tmp := ExtractTextBetween (result[i], 'href="','">');
    memo1.Lines.Add(tmp);
end;

As you can see in the div id center_col are all the Google Results. Now i need to do some kind of look to get all the A links and TEXT from an H3 Tag with a specific class name "r".

Hope that someone can help me!

HavelTheGreat
  • 3,299
  • 2
  • 15
  • 34
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 **Have you tried using an XML parser instead?** – David Heffernan Feb 25 '15 at 22:14
  • @DavidHeffernan the main problem is that cant get the google answer in XML format. You suggest to convert from html to XML? – Jose Luis Zanotti Feb 28 '15 at 04:15
  • In this case you have html, so you need an html parser. The principle remains the same. – David Heffernan Feb 28 '15 at 06:41
  • 1
    Thanks for your support, using IEParser and works great! now im getting all the content from the h3 tags using if (ElementInfo.tagName = 'h3') or (ElementInfo.tagName = 'H3') then if ElementInfo.className = 'r' then begin .... end; now need to parse the result to get the href and the anchor text. what recommend me to do that? i was using ExtractTextBetween function, maybe parse the result again? the string is anchor text – Jose Luis Zanotti Mar 01 '15 at 04:08

1 Answers1

2

Per recommendations below, I've changed my answer:

To parse HTML the most efficient way is to use a DOM-based HTML parser. Doing a quick search pulled up: http://www.yunqa.de/delphi/doku.php/products/htmlparser/index

From the main page: "HTML-Tags: HTML-Tags are readily parsed into Name, Attributes and Values. DIHtmlParser recognizes Start Tags, End Tags and Empty Element Tags. Example: ."

This product isn't the only one out there, but I've seen it mentioned on a few other SO posts.

Hope this helps

SmeTheWiz
  • 210
  • 1
  • 8
  • 1
    Ugh! Parsing HTML or XML with a regular expression is a foolish thing to do when a DOM-based parser makes it so much easier, faster, and less error-prone. (I'm not downvoting, because you put obvious effort into writing the answer and it may actually work, but I can't upvote it because I think it's technically an awful solution to the question asked.) – Ken White Feb 25 '15 at 23:24
  • 3
    @Sme You simply have to read bobince's answer that I linked to above. – David Heffernan Feb 25 '15 at 23:27
  • @DavidHeffernan I got a kick out of that post. Also I've never used a DOM-based parser (I very rarely parse HTML in my job). I come from a Perl background, so I tend to jump to patterns when I see a problem like this. I'll definitely have to look into using the other tools available. – SmeTheWiz Feb 26 '15 at 13:54
  • I honestly believe that you should delete this answer (the advice it gives is bad), and replace it with an answer recommending XML parser. – David Heffernan Feb 26 '15 at 14:00
  • No problem, thanks for going easy on me – SmeTheWiz Feb 26 '15 at 14:10
  • @SmeTheWiz tryed it, but cant get to parse what i want :/ – Jose Luis Zanotti Feb 28 '15 at 04:20