Web harvest -- remove unusual characters

Question

I'm trying to scrape a page that has some spaces after the anchors:

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

I can't seem to find a way to specify the text, and I either trigger a processor error, or I fail to detect the string itself. Everything AFTERthe causes the html-to-xml conversion to fail, since the xml is not well formed when the characters are included. So, I need to remove everything AFTER the (note that there are other parts where there is a div tag or something else after the elsewhere in the doc).

My code:

<xpath expression="/">
     <regexp replace="true">
            <regexp-pattern>(nbsp;)</regexp-pattern>
                <regexp-source>
                    <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
                       <http url="http://mysite.org/map/aindex/" method="get" />
                    </html-to-xml>
                </regexp-source>
                <regexp-result>
                    <template></template>
                </regexp-result>
      </regexp>
</xpath>

I think my problem is with the regexp-pattern. I've tried:

 &nbsp;  
    \& nbsp;  (without the space in between -- SO doesn't display that correctly
    \s+\|\s+

among other things. I even tried to put the expression in a CDATA element, but I can't get this to work either.

Any thoughts?

This looks like another good example of why regex-based web scraping is deficient. I hope you can figure out how to make it work. Here is a funny and classic Stack-O answer: http://stackoverflow.com/a/1732454/564406 — David, Oct 15 '12 at 14:26

score 2 · Answer 1 · answered Dec 08 '12 at 22:21

2

For   in regexp-pattern you can try to use \u00A0

answered Dec 08 '12 at 22:21

Alexander

21
3

Web harvest -- remove unusual characters

1 Answers1