1

I have the following piece of HTML I'm trying to run regex on with gregexpr function in R

<div class=g-unit>
<div class=nwp style=display:inline>
<input type=hidden name=cid value="22144">
<input autocomplete=off class=id-fromdate type=text size=10 name=startdate value="Sep 6, 2013"> -
<input autocomplete=off class=id-todate type=text size=10 name=enddate value="Sep 5, 2014">
<input id=hfs type=submit value=Update style="height:1.9em; margin:0 0 0 0.3em;">
</div>
</div>
</div>
<div id=prices class="gf-table-wrapper sfe-break-bottom-16">
<table class="gf-table historical_price">
<tr class=bb>
<th class="bb lm lft">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
...
...
</table>
</div>

I am trying to extract the table part from this html by using the following regex expression

<table\\s+class="gf-table historical_price">.+<

When I run the gregexpr function with perl=FALSE it works fine and I get a result However if I run it with perl=TRUE I get back nothing. It doesn't seem to match it

Does anyone know why the results are different from just switching Perl on and off? Many thanks in advance!

Axeman
  • 29,660
  • 2
  • 47
  • 102
Taavi
  • 165
  • 1
  • 1
  • 8
  • 4
    [You should not parse HTML with regexes](http://stackoverflow.com/a/1732454/725418). Use a parser instead. – TLP Sep 05 '14 at 17:14
  • I cannot readily obtain the content as posted, but to build on @TLP, something like this (using the XML package): doc <- htmlTreeParse('your countent URL', useInternal = TRUE) ; xpathSApply(doc, "//divclass='gf-table historical_price']//th", xmlValue, trim = TRUE) – lawyeR Sep 05 '14 at 19:29

2 Answers2

6

It seems that in the extended mode for regex, the dot is able to match newline characters, that is not the case in perl mode. To make it work in perl mode, you need to use the (?s) modifier to make the dot able to match newline characters too:

> m <- gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', str, perl = TRUE)

In many regex flavors, the dot doesn't match newlines by default, probably to make more handy a line by line job.

The s in the inline modifier (?s) stands for "singleline". In other words, this means that the whole string is seen as a single line (for the dot) even if there are newline characters.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
4

You need to use the inline (?s) modifier forcing the dot to match all characters, including line breaks.

The perl=T argument switches to the (PCRE) library that implements regex pattern matching.

gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', x, perl=T)

However as stated in the comments, a parser is recommended to do this. I would start out using the XML library.

cat(paste(xpathSApply(htmlParse(html), '//table[@class="gf-table historical_price"]', xmlValue), collapse = "\n"))
hwnd
  • 69,796
  • 4
  • 95
  • 132