1

I am working in R to analyze a complex structure of a web, and I want to extract the information that is contained in font tag, but happens to be that the data from tables are as well between font tags.

XPath examples:

text/div/font
table/tbody/tr/td/div/font

Since the structure is very complex, I can not predict the exact Xpath, so I am using //font as xpath to extract relevant data, but since the information in tables are contained as well in tags font, I am getting information that is not relevant for my analysis.

xpathCodefont <- "//font"
htmlCodeFonts <- xpathSApply(htmlCode,xpathCodefont,xmlValue)

Is there any syntax that allow me "to skip" the fonts that are coming from a path with tables? Or in other words, how could I avoid fonts that have table as ancestors (but not as direct parent).

Thanks in advance,

MariPlaza
  • 357
  • 2
  • 5
  • 16

1 Answers1

2

It would have been nice to include a reproducible example so we could test possible solutions, but I think you want

xpathCodefont <- "//font[not(ancestor::table)]"

That should return any font tags that are not inside tables.

Community
  • 1
  • 1
MrFlick
  • 195,160
  • 17
  • 277
  • 295