0

I am using web harvest (http://web-harvest.sourceforge.net/), the open source web scraping tool.

The regex I am trying to use has "<", ">" characters (because I am trying to strip out all HTML tags that come in). This causes a problem because the content of the elements must consist of well-formed character data or markup.

I need to somehow escape the regex, but can't figure out how.

Any ideas?

kburns
  • 782
  • 2
  • 8
  • 22
  • HTML parsing is a solved problem. Consider do you actually need to reinvent a solution using a regex. A mandatory SO link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jasso Feb 10 '11 at 21:08

1 Answers1

1

To make the regular expression well-formed XML. Try replacing < with &lt; and > with &gt;. Similarly if you have an & in your regular expression you will need to replace that with &amp;.

Also I'd suggest you use an HTML parser instead of a regular expression for this task.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452