Regular expression to pick a row in an html table containing desired text

Question

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.

I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.

Ok, here's a snippet from such an html table to parse:

<tr><td><a href="blu">blu</a></td><td>value</td></tr><tr><td><a
href="bla">findme</a></td><td>value</td></tr><tr><td><a
href="ble">ble</a></td><td>value</td></tr>

The desired row to pick contains the word "findme".

(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.

I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .

Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.

Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).

The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.

Your question is a bit confusing. Are you asking for a matlab solution or a PHP solution? If it's PHP, then there are better way to parse HTML than Regex (which [isn't well suited for parsing HTML](https://stackoverflow.com/a/1732454/2453432)). I would recommend using something like [DOMDocument](https://www.php.net/manual/en/class.domdocument.php) instead. But that's if you're looking for a PHP solution. If not, please remove the PHP tag. — M. Eriksson, Sep 11 '19 at 09:54
@MagnusEriksson I am looking for a matlab solution (m-file or matlab function). I include PHP only because it is my understanding that matlab regexp is PHP like and thought I might find more people with PHP than matlab regexp background. Also, https://regex101.com/ which I played with does not do matlab, only PHP. — Buck Thorn, Sep 11 '19 at 09:59
@MagnusEriksson I think a PHP pattern will work equally well within Matlab. — Buck Thorn, Sep 11 '19 at 10:01
*how to return a row containing desired keywords* the keyword being "bla" ? — Paolo, Sep 11 '19 at 10:08
@UnbearableLightness don't have that toolbox but thanks for the tip. The keyword example is "findme". Bla etc is just filler. — Buck Thorn, Sep 11 '19 at 10:10

Paolo · Accepted Answer · 2019-09-11T11:33:48.560

I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:

htmlString = ['<tr><td><a href="blu">blu</a></td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];

keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]

The output is:

<tr><td><ahref="bla">findme</a></td><td>value</td></tr>

Alternatively you may also combine extractBetween and contains:

allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']

If you must use regex:

regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')

Thanks, true that with strsplit this is much simpler. – Buck Thorn Sep 11 '19 at 13:54 — Buck Thorn, Sep 11 '19 at 13:54

score 0 · Answer 2 · answered Sep 11 '19 at 10:23

0

Try this

%<td><a href="bla">(.*?)</a>%sg

https://regex101.com/r/0Xq0mO/1

answered Sep 11 '19 at 10:23

ZMBA

65
7

Sorry, should modify my question. Basically bla etc (content of other cells and tags in the table) could be anything - the important part is the presence of "findme" and that a single line is caught. – Buck Thorn Sep 11 '19 at 11:48

Regular expression to pick a row in an html table containing desired text

2 Answers2