15

How can I parse Name: & Value text from within the tag with DIHtmlParser? I tried doing it with TCLHtmlParser from Clever Components but it failed. Second question is can DIHtmlParser parse individual tags for example loop through its sub tags. Its a total nightmare for such a simple problem.

<div class="tvRow tvFirst hasLabel tvFirst" title="example1">
  <label class="tvLabel">Name:</label>
  <span class="tvValue">Value</span>
<div class="clear"></div></div>

<div class="tvRow tvFirst hasLabel tvFirst" title="example2">
  <label class="tvLabel">Name:</label>
  <span class="tvValue">Value</span>
<div class="clear"></div></div>
  • Welcome to StackOverflow. AFAIK there's no standard way to convert HTML to JSON. Edit your question to be more precise and provide some examples of what are you trying to accomplish if you expect a useful answer, since in it's current state, your question is overly broad and candidate to be closed. – jachguate Jan 15 '13 at 22:51
  • 3
    It seems to me that once you have something *capable* of parsing an XHTML document such that it could be converted losslessly to a JSON document, you don't actually *need* JSON anymore. Just use whatever structure the XHTML interpreter has generated directly. At that point, you don't need an HTML-to-JSON converter; you just need a HTML library that lets you programmatically access the document. – Rob Kennedy Jan 15 '13 at 22:52
  • @RobKennedy Parsing JSON is faster than XML or HTML. Sorry to break your bubble. :) –  Jan 15 '13 at 22:54
  • @t0xic Converting HTML to JSON requires an HTML parser. – David Heffernan Jan 15 '13 at 22:56
  • 3
    Sorry to break *your* bubble, but if you're going to convert from HTML to JSON, you're going to have to parse HTML first. Only after the conversion is complete can you begin parsing JSON. Therefore, the speed of converting HTML to JSON and then parsing JSON will always be slower than the speed of parsing HTML by itself. – Rob Kennedy Jan 15 '13 at 22:56
  • 3
    @t0xic Use an HTML parser. Regex cannot be used to parse HTML. Not even Jon Skeet can parse HTML with regex etc etc. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – David Heffernan Jan 15 '13 at 23:03
  • I think the question you need to be asking is what is the fastest HTML parser that can be practically used from Delphi. – David Heffernan Jan 15 '13 at 23:10
  • Depending on input, and despite David's comment about Skeet's inability to parse HTML with Regex, you *might* be able to use Regex in order to extract Name=Value pairs. It all depends on what you want to do, how generic the solution needs to be, the variance of the HTML you'll be dealing with. Every time I needed to extract data from HTML I used RegEx with success, never had the need for a true parser. – Cosmin Prund Jan 15 '13 at 23:42
  • @CosminPrund Or extract the tag with Regex and parse the internals with html parser.. that would work too :) –  Jan 15 '13 at 23:48
  • 3
    @t0xic, extracting the tag itself is the difficult part: once you've got the tag you can easily extract the attributes. But don't kid yourself, you really can't parse HTML with RegEx: you'll have a hard time guessing if the tag you're seeing is within a comment, or maybe it's invalid because it never closes. That's why I'm saying it depends on INPUT. If you want to scrape a specific piece of information from a specific web site at a specific time, you'll be OK with RegEx. If it gets more complicated then that, you'll need your parser. – Cosmin Prund Jan 15 '13 at 23:54
  • @CosminPrund Yes my question is how to do it. Thats the problem. Not many parsers can extract the tag along with all its sub tags.. –  Jan 15 '13 at 23:57
  • If all the input is like what you show in the question, it looks like a task you can easily accomplish with regex. – jachguate Jan 16 '13 at 00:02
  • @jachguate the website is very dynamic.. html changes often. –  Jan 16 '13 at 00:06

3 Answers3

18

You could use IHTMLDocument2 DOM to parse whatever elements you need from the HTML:

uses ActiveX, MSHTML;

const
  HTML =
  '<div class="tvRow tvFirst hasLabel tvFirst" title="example1">' +
  '<label class="tvLabel">Name:</label>' +
  '<span class="tvValue">Value</span>' +
  '<div class="clear"></div>' +
  '</div>';

procedure TForm1.Button1Click(Sender: TObject);
var
  doc: OleVariant;
  el: OleVariant;
  i: Integer;
begin
  doc := coHTMLDocument.Create as IHTMLDocument2;
  doc.write(HTML);
  doc.close;
  ShowMessage(doc.body.innerHTML);
  for i := 0 to doc.body.all.length - 1 do
  begin
    el := doc.body.all.item(i);
    if (el.tagName = 'LABEL') and (el.className = 'tvLabel') then
      ShowMessage(el.innerText);
    if (el.tagName = 'SPAN') and (el.className = 'tvValue') then
      ShowMessage(el.innerText);
  end;
end;

I wanted to mention another very nice HTML parser I found today: htmlp (Delphi Dom HTML Parser and Converter). It's not as flexible as the IHTMLDocument2 obviously, but it's very easy to work with, fast, free, and supports Unicode for older Delphi versions.

Sample usage:

uses HtmlParser, DomCore;

function GetDocBody(HtmlDoc: TDocument): TElement;
var
  i: integer;
  node: TNode;
begin
  Result := nil;
  for i := 0 to HtmlDoc.documentElement.childNodes.length - 1 do
  begin
    node := HtmlDoc.documentElement.childNodes.item(i);
    if node.nodeName = 'body' then
    begin
      Result := node as TElement;
      Break;
    end;
  end;
end;

procedure THTMLForm.Button2Click(Sender: TObject);
var
  HtmlParser: THtmlParser;
  HtmlDoc: TDocument;
  i: Integer;
  body, el: TElement;
  node: TNode;
begin
  HtmlParser := THtmlParser.Create;
  try
    HtmlDoc := HtmlParser.parseString(HTML);
    try
      body := GetDocBody(HtmlDoc);
      if Assigned(body) then
        for i := 0 to body.childNodes.length - 1 do
        begin
          node := body.childNodes.item(i);
          if (node is TElement) then
          begin
            el := node as TElement;
            if (el.tagName = 'div') and (el.GetAttribute('class') = 'tvRow tvFirst hasLabel tvFirst') then
            begin
              // iterate el.childNodes here...
              ShowMessage(IntToStr(el.childNodes.length));
            end;
          end;
        end;
    finally
      HtmlDoc.Free;
    end;
  finally
    HtmlParser.Free
  end;
end;
kobik
  • 21,001
  • 4
  • 61
  • 121
0

Use a HTML Parser to work on your html files.

Maybe DIHtmlParser will do the job.

RegEx is not a parser and converting from HTML to JSON is not a wise option.

Sir Rufo
  • 18,395
  • 2
  • 39
  • 73
  • 1
    DIHtmlParser? Its a very complex parser. Nothing like TCLHtmlParser from Clever Components. I tried to use it and its a nightmare. Capability sure but usability? Horrible.. –  Jan 15 '13 at 23:04
0

One can also use a combination of HTMLP parser with THtmlFormatter and OXml XPath parsing

uses
  // Htmlp
  HtmlParser,
  DomCore,
  Formatter,
  // OXml
  OXmlPDOM,
  OXmlUtils;

function HtmlToXHtml(const Html: string): string;
var
  HtmlParser: THtmlParser;
  HtmlDoc: TDocument;
  Formatter: THtmlFormatter;
begin
  HtmlParser := THtmlParser.Create;
  try
    HtmlDoc := HtmlParser.ParseString(Html);
    try
      Formatter := THtmlFormatter.Create;
      try
        Result := Formatter.GetText(HtmlDoc);
      finally
        Formatter.Free;
      end;
    finally
      HtmlDoc.Free;
    end;
  finally
    HtmlParser.Free;
  end;
end;

type
  TCard = record
    Store: string;
    Quality: string;
    Quantity: string;
    Price: string;
  end;
  TCards = array of TCard;

function ParseCard(const Node: PXMLNode): TCard;
const
  StoreXPath = 'div[1]/ax';
  QualityXPath = 'div[3]';
  QuantityXPath = 'div[4]';
  PriceXPath = 'div[5]';
var
  CurrentNode: PXMLNode;
begin
  Result := Default(TCard);
  if Node.SelectNode(StoreXPath, CurrentNode) then
     Result.Store := CurrentNode.Text;
  if Node.SelectNode(QualityXPath, CurrentNode) then
     Result.Quality := CurrentNode.Text;
  if Node.SelectNode(QuantityXPath, CurrentNode) then
     Result.Quantity := CurrentNode.Text;
  if Node.SelectNode(PriceXPath, CurrentNode) then
     Result.Price := CurrentNode.Text;
end;

procedure THTMLForm.OpenButtonClick(Sender: TObject);
var
  Html: string;
  Xml: string;
  FXmlDocument: IXMLDocument;
  QueryNode: PXMLNode;
  XPath: string;
  NodeList: IXMLNodeList;
  i: Integer;
  Card: TCard;
begin
  Html := System.IOUtils.TFile.ReadAllText(FileNameEdit.Text, TEncoding.UTF8);
  Xml := HtmlToXHtml(Html);
  Memo.Lines.Text := Xml;

  // Parse with XPath
  FXMLDocument := CreateXMLDoc;
  FXMLDocument.WriterSettings.IndentType := itIndent;
  if not FXMLDocument.LoadFromXML(Xml) then
    raise Exception.Create('Source document is not valid');
  QueryNode := FXmlDocument.DocumentElement;
  XPath := '//div[@class="row pricetableline"]';
  NodeList := QueryNode.SelectNodes(XPath);
  for i := 0 to NodeList.Count -1 do
  begin
    Card := ParseCard(NodeList[i]);
    Memo.Lines.Text := Memo.Lines.Text + sLineBreak +
      Format('%0:s %1:s %2:s %3:s', [Card.Store, Card.Quality, Card.Quantity, Card.Price]);
  end;

  Memo.SelStart := 0;
  Memo.SelLength := 0;
end;
Gad D Lord
  • 6,620
  • 12
  • 60
  • 106