0

Very new to regex and haven't found a descriptive explaination to narrow down my understanding of regex to get me to a solution.

I use a script that scrapes html script from Yahoo finance to get financial options table data. Yahoo recently changed their HTML code and the old algorithm no longer works. The old expression was the following:

Main_Pattern = '.*?</table><table[^>]*>(.*?)</table';
Tables = regexp(urlText, Main_Pattern, 'tokens');

Where Tables used to return data, it no longer does. An HTML inspection of the HTML suggests to me that the data is no longer in <table>, but rather in <tbody>...

My question is "what does the Main_Pattern regex mean in layman's terms?" I'm trying to figure how to modify that expression such that is is applicable to the current HTML.

horchler
  • 18,384
  • 4
  • 37
  • 73
  • 2
    In general its [not a good idea](http://goo.gl/Mc5m) to use reg exp for html. – Marcin Oct 27 '14 at 22:56
  • Related to @Marcin's comment – if you want to parse your webpages properly in a way that is much less like to be fragile, might [start with this question and answer](http://stackoverflow.com/questions/20542351/how-to-read-and-parse-the-html-file) and then [see here](http://stackoverflow.com/questions/238036/java-html-parsing). Unfortunately, it one are of Matlab where you may need to do some lower-level work yourself. – horchler Oct 27 '14 at 23:38

1 Answers1

1

While I agree with @Marcin and Regular Expressions are best learned by doing and leveraging the reference of your chosen tool, I'll try and break down in what it is doing.

  1. .*?</table>: Match anything up to the first </table> literal (This is a Lazy expression due to the ?).

  2. <table: Match this literal.

  3. [^>]*>: Match as much as possible that isn't > from after <table literal to the last occurrence of a > that satisfies the rest of the expression (this is a Greedy expression since there is no ? after the *).

  4. (.*?)</table: Match and capture anything between the > from the previous part up to the </table literal; what was captured can be retrieved using the 'tokens' options of regexp (you can also get the entire string that was matched using the 'match' option).

While I broke it into pieces, I'd like to emphasize that the entire expression itself works as a whole, which is why some parts refer to the previous parts.

Refer to the Operators and Characters section of the MATLAB documentation for more in-depth explanations of the above.



For the future, a more robust option might be to use MATLAB's xmlread and DOM object to traverse the table nodes. I do understand that that is another API to learn, but it may be more maintainable for the future.

TroyHaskin
  • 8,361
  • 3
  • 22
  • 22