1

I have the following html string: <a href="http://www.nndc.bnl.gov/nsr/fastsrch_act2.jsp?aname=F.V.Adamian">F.V.Adamian</a>, <ahref="http://www.nndc.bnl.gov/nsr/fastsrch_act2.jsp?aname=G.G.Akopian">G.G.Akopian</a>

I want to form a single plain text string with the author names so that it looks something like (I can fine tune the punctuation later):

F.V.Adamian, G.G.Akopian.

I'm trying to use 'regexp' in Matlab. When I do the following: regexpi(htmlstring,'">.*</a>','match')

I get:

">F.V.Adamian</a>, <a href="http://www.nndc.bnl.gov/nsr/fastsrch_act2.jsp?aname=G.G.Akopian">G.G.Akopian</a>,

Why? I'm trying to get it to continuously output (hence I did not use the 'once' operator) all characters between "> and , which is the author's name. It works fine for the first one but not for the second. I am happy to truncate the "> and with a regexprep(regexpstring,'','') later.

I see that regexprep(htmlstr, '<.*?>','') works and does what I want. But I don't get it...

Kent
  • 143
  • 6
  • 3
    Whenever I see `HTML` with `Regex`, Only [this question](http://stackoverflow.com/q/1732348/1679863) comes to my mind. – Rohit Jain Jul 09 '13 at 19:51
  • 1
    Interesting... however, I think using regular expression parsing for sufficiently simple html strings is possible. Also Matlab seems to have implemented some powerful features specifically for html. – Kent Jul 09 '13 at 19:59
  • As far as I know, the only proper way to parse arbitrary HTML to DOM in Matlab is to call Java classes. If the file is valid XML, i.e., [XHTML](http://en.wikipedia.org/wiki/XHTML), I think that [`xmlread`](http://www.mathworks.com/help/matlab/ref/xmlread.html) (which is based on the [SAX parser](http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html)) could be made to work. – horchler Jul 09 '13 at 20:08

1 Answers1

2

In .*? the ? is telling the .* to be lazy as opposed to greedy. By default, .* will try to match the largest thing it can. When you add the ? it instead goes for the smallest thing it can

source

Jean-Bernard Pellerin
  • 12,556
  • 10
  • 57
  • 79
  • 1
    Thanks! That helped a lot. The question mark as a greediness operator is blowing my mind! – Kent Jul 09 '13 at 20:03