1

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.

I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.

Ok, here's a snippet from such an html table to parse:

<tr><td><a href="blu">blu</a></td><td>value</td></tr><tr><td><a
href="bla">findme</a></td><td>value</td></tr><tr><td><a
href="ble">ble</a></td><td>value</td></tr>

The desired row to pick contains the word "findme".

(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.

I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .

Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.

Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).

The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.

Buck Thorn
  • 5,024
  • 2
  • 17
  • 27
  • Your question is a bit confusing. Are you asking for a matlab solution or a PHP solution? If it's PHP, then there are better way to parse HTML than Regex (which [isn't well suited for parsing HTML](https://stackoverflow.com/a/1732454/2453432)). I would recommend using something like [DOMDocument](https://www.php.net/manual/en/class.domdocument.php) instead. But that's if you're looking for a PHP solution. If not, please remove the PHP tag. – M. Eriksson Sep 11 '19 at 09:54
  • @MagnusEriksson I am looking for a matlab solution (m-file or matlab function). I include PHP only because it is my understanding that matlab regexp is PHP like and thought I might find more people with PHP than matlab regexp background. Also, https://regex101.com/ which I played with does not do matlab, only PHP. – Buck Thorn Sep 11 '19 at 09:59
  • @MagnusEriksson I think a PHP pattern will work equally well within Matlab. – Buck Thorn Sep 11 '19 at 10:01
  • Do you have access to the **Text Analytics Toolbox** ? – Paolo Sep 11 '19 at 10:06
  • *how to return a row containing desired keywords* the keyword being "bla" ? – Paolo Sep 11 '19 at 10:08
  • @UnbearableLightness don't have that toolbox but thanks for the tip. The keyword example is "findme". Bla etc is just filler. – Buck Thorn Sep 11 '19 at 10:10

2 Answers2

2

I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:

htmlString = ['<tr><td><a href="blu">blu</a></td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];

keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}] 

The output is:

<tr><td><ahref="bla">findme</a></td><td>value</td></tr>

Alternatively you may also combine extractBetween and contains:

allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']

If you must use regex:

regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')
Paolo
  • 21,270
  • 6
  • 38
  • 69
0

Try this

%<td><a href="bla">(.*?)</a>%sg

https://regex101.com/r/0Xq0mO/1

ZMBA
  • 65
  • 7
  • Sorry, should modify my question. Basically bla etc (content of other cells and tags in the table) could be anything - the important part is the presence of "findme" and that a single line is caught. – Buck Thorn Sep 11 '19 at 11:48