0

I have to strip some XML tags from the text and leave their values.

Example

text text <tag>tag_value</tag> text text <a href="example.com">example.com</a>
->
text text tag_value text text example.com

So far, I've used boost_replace but now I am not able to use that library.

std::string src(text);
std::string fmt ="";
std::string ex = "(<tag attribute=\"(.*?)\">)|(</tag>)|(<a href(.*?)\">)|(</a>)|(<tag>)|(</tag>))";
boost::regex expr(ex);
std::string s2 = boost::regex_replace(src, expr, fmt, boost::match_default | boost::format_all);

How could I solve that problem? What library could help me do that? Thanks

user1112008
  • 432
  • 10
  • 27
  • 1
    If you not able to use Boost, what are criteria of your library selection (e.g. what libraries are unacceptable too)? – hate-engine Dec 21 '12 at 21:04
  • Just 'light-weight' are acceptable – user1112008 Dec 21 '12 at 21:25
  • What is unacceptable about Boost? In the final executable, you will only have the parts of Boost that you actually use, which is all that you should really care about. You don't have to have any users download some 600+ MB file so that they have "all of Boost". – David Stone Dec 21 '12 at 21:53
  • To be fair, my workplace has also banned Boost because it is "not lightweight." When you distribute that much source in a monolithic bundle and it isn't easily separable, that's an easy reputation to get. – StilesCrisis Dec 22 '12 at 02:28

1 Answers1

1

Never use regular expressions to parse XML!

See RegEx match open tags except XHTML self-contained tags

You need a real XML library like expat or libxml2.

Community
  • 1
  • 1
StilesCrisis
  • 15,972
  • 4
  • 39
  • 62
  • Question was about *stripping* tags without any further processing. It's ok to use regexps here. – hate-engine Dec 21 '12 at 21:02
  • It's still relevant. Throw a `<[[CDATA` in there or a ` – StilesCrisis Dec 21 '12 at 21:03
  • Ok, right, there is no simple regexp solution for stripping, however, I still think it's overkill *here* to use full-blown parser. – hate-engine Dec 21 '12 at 21:06
  • The OP did not say whether the XML comes from a known trusted source or not. So we have to assume that the XML could come from anywhere, and could contain things we don't expect. If you control the XML completely, from generation to parsing, and we're okay with not being fully XML compliant in our parser, then sure, we can compromise. But honestly, why recommend the half-assed approach? Let's do it right. – StilesCrisis Dec 21 '12 at 21:10
  • by the way the boost counterpart for this is boost::spirit – user1849534 Dec 21 '12 at 22:02
  • If anything you'd want a `boost::property_tree`. http://www.boost.org/doc/libs/1_41_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser – StilesCrisis Dec 22 '12 at 02:25