I'm trying to scrape a page that has some spaces after the anchors:
</a> |
I can't seem to find a way to specify the text, and I either trigger a processor error, or I fail to detect the string itself. Everything AFTERthe causes the html-to-xml conversion to fail, since the xml is not well formed when the characters are included. So, I need to remove everything AFTER the (note that there are other parts where there is a div tag or something else after the elsewhere in the doc).
My code:
<xpath expression="/">
<regexp replace="true">
<regexp-pattern>(nbsp;)</regexp-pattern>
<regexp-source>
<html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
<http url="http://mysite.org/map/aindex/" method="get" />
</html-to-xml>
</regexp-source>
<regexp-result>
<template></template>
</regexp-result>
</regexp>
</xpath>
I think my problem is with the regexp-pattern. I've tried:
\& nbsp; (without the space in between -- SO doesn't display that correctly \s+\|\s+
among other things. I even tried to put the expression in a CDATA element, but I can't get this to work either.
Any thoughts?