3

I want to remove any tags such as

<p>hello <namespace:tag : a>hello</namespace:tag></p>

to become

 <p> hello hello </p>

What is the best way to do this if it is regex for some reason this is now working can anyone help?

(<|</)[:]{1,2}[^</>]>

edit: added

Paul
  • 1,375
  • 4
  • 16
  • 29

4 Answers4

3

Definitely use an XML parser. Regex should not be used to parse *ML

Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • Direct link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Louis Wasserman Feb 02 '12 at 22:40
  • @LouisWasserman: I was just adding that link. That answer is a standard. – RanRag Feb 02 '12 at 22:42
  • 2
    @sln Using terms like "such as" and "what is the best way" should indicate that he isn't looking for a specific solution but a more general solution. I'm flagging your comments as unconstructive. – The Real Baumann Feb 02 '12 at 23:07
  • @The Real Baumann - I would agree with you if the OP didn't use explicit examples of what his problem was. It's not as if he is looking for the pro's/con's of xml regex parsing in general. If you don't want my solutions on this board then unfortunately I will go elsewhere. –  Feb 02 '12 at 23:38
  • @sln I'd like to direct your attention to the question itself, which is "Java Regex or XML parser?" And the correct answer to that question is indeed "XML parser". – biziclop Feb 03 '12 at 00:19
3

You should not use regex for these purposes use a parser like lxml or BeautifulSoup

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

Here is a reason why you should not parse html/xml with regex.

Community
  • 1
  • 1
RanRag
  • 48,359
  • 38
  • 114
  • 167
  • +1 - My mistake. I just want to see solutions insted of the standard "don't do xml parsing with regex", your's had a solution, sorry! –  Feb 02 '12 at 23:45
  • I tried, but it wants you to edit it before it will reverse my vote to upvote. Just do a pseudo edit and my upvote will be enabled. I'll check back later. –  Feb 03 '12 at 00:43
2

If you're just trying to pull the plain text out of some simple XML, the best (fastest, smallest memory footprint) would be to just run a for loop over the data:

PSEUDOCODE BELOW

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

Note: This will break if you encounter things like CDATA, JavaScript, or CSS in your parsing.

So, to sum up... if it's simple, do something like above and not a regular expression. If it isn't that simple, listen to the other guys an use an advanced parser.

The Real Baumann
  • 1,941
  • 1
  • 14
  • 20
  • He didn't specify whether he was reading from a stream or just a string, or whether his content has CDATA or the like so that part of the answer varies. I was just providing a simple solution that covers a large subset of the problem domain. Thanks for the criticism though. – The Real Baumann Feb 02 '12 at 23:05
  • +1 - Sorry, my bad. Put up a pseudo-edit so my upvote can count. –  Feb 02 '12 at 23:46
0

This is a solution I personally used for a likewise problem in java. The library used for this is Jsoup : http://jsoup.org/.

In my particular case I had to unwrap tags that had an attribute with a particular value in them. You see that reflected in this code, it's not the exact solution to this problem but could put you on your way.

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }
kenny
  • 1,157
  • 1
  • 16
  • 41