2

I have an html file and need to read it and access to some values :

myHtml = 'toto.html';
readFile = fileread(myHtml);

now to parse the html file , do you know if it's possible to convert html to xml and then use xpath ?

lola
  • 5,649
  • 11
  • 49
  • 61
  • 1
    I would use a java-based HTML-Parser, you can run java-code directly from the Matlab command line. http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers – Daniel Dec 12 '13 at 12:10
  • you mean XPATH on html file ? to do that I should read the file with xmlread which is not possible .... – lola Dec 12 '13 at 13:29
  • forget my previous comment, use Daniel's suggestion – MZimmerman6 Dec 12 '13 at 13:42
  • I'm flagging this question as a duplicate as this was previously discussed in many questions, such as [this](http://stackoverflow.com/questions/6706980/extracting-data-from-within-xml-files-using-matlab), [this](http://stackoverflow.com/questions/14477122/counting-number-of-elements-from-xml-using-xpath) and [this](http://stackoverflow.com/questions/11548590/how-to-get-matlab-to-read-correct-amount-of-xml-nodes). [This one](http://stackoverflow.com/questions/14477122/counting-number-of-elements-from-xml-using-xpath) could be especially useful for you, as it also shows how to use XPath. – Eitan T Dec 12 '13 at 15:25
  • 1
    @EitanT: All linked question deal with XML, not HTML. – Daniel Dec 12 '13 at 17:02
  • It's just a subset of XML. What's preventing you from using the tools in the suggested answers? – Eitan T Dec 12 '13 at 19:26
  • 1
    HTML is not a subset of XML. – Prashant Kumar Dec 12 '13 at 19:27
  • Perhaps I confused it with XHTML. Either way, you could still use XPath to parse it in most cases. – Eitan T Dec 13 '13 at 13:11

1 Answers1

1

I would not recommend attempting to convert HTML to XML. They are different formats, and you are likely to get burned. HTML parsers exist, so we can use those directly.

Also, just for completeness, don't try and parse HTML with regex. There are Stack Overflow questions about parsing HTML in Matlab in which the answers recommend regex. Do innocent kittens a favor and tune them out.

Unfortunately, it doesn't look like Matlab has an HTML parser as part of it's library.

Fortunately, you can leverage Java code with ease in Matlab!
With that, Java HTML parsers are fair game. Look into jsoup or jtidy. Poke around this question.

Actually, looking at that question, plus the Comparison of HTML parsers Wikipedia article (thanks @Daniel R!) it looks like HTMLCleaner or Jtidy might clean HTML to XML. Again, I wouldn't bother and would simply parse HTML directly.

Community
  • 1
  • 1
Prashant Kumar
  • 20,069
  • 14
  • 47
  • 63