R regex: specifying output selections from wider string matches

Question

One for the regex enthusiasts. I have a vector of strings in the format:

<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" STYLE="font-size: 10px" size="10" COLOR="#FF0000" LETTERSPACING="0" KERNING="0">Desired output string containing any symbols</FONT></P></TEXTFORMAT>

I'm aware of the perils of parsing this sort of stuff with regex. It would however be useful to know how to efficiently extract an output sub-string of a larger string match - i.e. the contents of angle quotes >...< of the font tag. The best I can do is:

require(stringr)
strng = str_extract(strng, "<FONT.*FONT>") # select font statement
strng = str_extract(strng, ">.*<")         # select inside tags
strng = str_extract(strng, "[^/</>]+")     # remove angle quote symbols

What would be the simplest formula to achieve this in R?

Richie Cotton · Accepted Answer · 2013-10-22T11:13:11.717

3

Use str_match, not str_extract (or maybe str_match_all). Wrap the part that you want to ~~extract~~ match in parentheses.

str_match(strng, "<FONT[^<>]*>([^<>]*)</FONT>")

Or parse the document and extract the contents that way.

library(XML)
doc <- htmlParse(strng)
fonts <- xpathSApply(doc, "//font")
sapply(fonts, function(x) as(xmlChildren(x)$text, "character"))

As agstudy mentioned, xpathSApply takes a function argument that makes things easier.

xpathSApply(doc, "//font", xmlValue)

edited Oct 22 '13 at 11:13

answered Oct 22 '13 at 09:46

Richie Cotton

118,240
47
247
360

Ah the brackets specify the output. But in that case a simpler formula is `str_match(strng, "(.*)")[1,2]`. I did play around with the XML method but failed to get that working , so thanks for that too. – geotheory Oct 22 '13 at 10:08
and `...[,2]` for the vector. – geotheory Oct 22 '13 at 10:14
@geotheory There's a tradeoff of how specific you want to be in the match. More specific matches are usually faster and have less false positive matches, but you might have a false negative if your text contains unusual characters. – Richie Cotton Oct 22 '13 at 10:33
Noted. I wonder Richie if you could possibly advise on [this related question](http://stackoverflow.com/questions/19497652/reading-kml-files-into-r?noredirect=1#comment28928052_19497652) re your XML method. I'm working with a KML file but the XML structure is proving tricky to parse. – geotheory Oct 22 '13 at 10:49
@RichieCotton +1!you can do shorter (omit the last line) by something like `xpathSApply(doc, "//font",xmlChildren)$text` or `xpathSApply(doc, "//font",xmlValue)` – agstudy Oct 22 '13 at 11:06

Simon O'Hanlon · Answer 2 · 2013-10-22T09:58:31.950

You can also do it with gsub but I think there are too many permutations to your input vector that may cause this to break...

gsub( "^.*(?<=>)(.*)(?=</FONT>).*$" , "\\1" , x , perl = TRUE )
#[1] "Desired output string containing any symbols"

Explanation

^.* - match any characters from the start of the string
(?<=>) - positive lookbehind zero-width assertion where the subsequent match will only work if it is preceeded by this, i.e. a >
(.*) - then match any characters (this is now a numbered capture group)...
(?=</FONT>) - ...until you match "</FONT>"
.*$ - then match any characters to the end of the string

In the replacement we replace all matched stuff by numbered capture group \\1, and there is only one capture group which is everything between > and </FONT>.

Use at your peril.

Thanks Simon. Yes I've heard that line: "you had a problem and tried regex. now you have two problems.." :) — geotheory, Oct 22 '13 at 10:11

R regex: specifying output selections from wider string matches

2 Answers2

Explanation