2

I get some malformed xml text input like:

"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"

I want to clean the input so to get:

"<Tag>something</Tag> 8 &gt; 3, 2 &lt; 3, ... <Tag>something</Tag>"

That is, escape those special symbols like <,> and yet keep the valid tags ("<Tag>something</Tag>, note, with the same case)

Do you know of any java library to do this? Probably a xml/html parser? (though I don't really need a parser, simple a "clean" procedure)

juanmirocks
  • 4,786
  • 5
  • 46
  • 46
  • Orphan '>' characters are not a problem. But how can you tell whether a particular '<' character is a tag or a less-than symbol? Do your XML documents follow a single DTD or XML Schema? Or, are '<' *always* followed by something like a digit that is not a `Name` in XML? – erickson Dec 13 '11 at 13:50
  • They are not my xml documents and there is no schema. Unfortunately, I found a case where a "<" symbols was not either followed by a space or digit... – juanmirocks Dec 13 '11 at 13:53

5 Answers5

6

JTidy is "HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML"

But it can also be used with xml. Check the documentation. It's incredible smart, it will probably work for you.

Pablo Grisafi
  • 5,039
  • 1
  • 19
  • 29
2

I don't know of any library that would do that. Your input is malformed XML, and no proper XML parser would accept it. More important, it is not always possible to distinguish an actual tag from something that looks-like-a-tag-but-is-really-text. Therefore any heuristic-based attempt that you make to solve the problem will be fragile; i.e. it could occasionally produce malformed XML.

The best approach is address the problem before you assemble the XML.

  • If you generate the XML by (for example) unparsing a DOM, then the unparser will take care of the escaping for you.
  • If you are generating the XML by templating or string bashing, then you need to call something like StringEscapeUtils.escapeXml on the relevant text chunks ... before the XML tags get incorporated.

If you leave the problem until after the "XML" has been assembled, it cannot be properly fixed.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • I don't assemble the XML. You're right, any heuristic-based attempt may eventually fail. Yet I think/hope that a solution like @gatkin 's would make it for the input I get. – juanmirocks Dec 13 '11 at 13:49
  • *"I don't assemble the XML"* - then the best solution is to *reject* the XML as being malformed. Use your favorite XML validator to provide documentary evidence. Interoperability standards are standards, and software that doesn't conform to them are **buggy** and should be fixed ... not compensated for. – Stephen C Dec 14 '11 at 04:58
  • I crawl and fetch some data given by a bioinformatics database and I must get that data. Still, I take your point. – juanmirocks Dec 14 '11 at 13:26
1

The best solution is to fix the program generating your text input. The easiest such fix would involve an escape utility like the other answers suggested. If that's not an option, I'd use a regular expression like

</?[a-zA-Z]+ */?>

to match the expected tags, and then split the string up into tags (which you want to pass through unchanged) and text between tags (against which you want to apply an escape method.)

I wouldn't count on an XML parser to be able to do it for you because what you're dealing with isn't valid XML. It is possible for the existing lack of escaping to produce ambiguities, so you might not be able to do a perfect job either.

gatkin
  • 1,902
  • 12
  • 12
0

Check out Guava's XmlEscaper. It is in pre-release for version 11 but the code is available.

John B
  • 32,493
  • 6
  • 77
  • 98
  • No. (At least from what I read in the code) Like StringEscapeUtils.escapeXml everything is escaped, special symbols of proper tags too. – juanmirocks Dec 13 '11 at 13:05
  • What do you mean by `special symbols of proper tags`? – John B Dec 13 '11 at 13:07
  • See, the special symbols of 'something' should not be escaped – juanmirocks Dec 13 '11 at 13:10
  • I don't get your point. Everything in XML is within SOME tag. In your example, `8 > 3, 2 < 3, ...` is in the contents of the parent tag of `Tag` just as `something` is the contents of the `Tag` tag. There is no distinction in XML parsing. Seems as though you are attempting to place an arbitrary distinction that no library would support. – John B Dec 13 '11 at 13:24
  • Right, it's wrong XML syntax, that's why I say I need to clean the input. The cleaner should understand that those special symbols in `8 > 3, 2 < 3` don't open a new tag and should be escaped. Maybe I'm wrong and this is actually HTML but that would be my use case. – juanmirocks Dec 13 '11 at 13:30
  • My point is there is no difference between `8 > 3, 2 < 3` and `something`. Both are contents of XML tags and therefore any escaper will treat them the same. You seem to be asking for the escaper to treat the contents of some tags differently from the contents of other tags. – John B Dec 13 '11 at 13:37
-1

Apache Commons Lang contains a class named StringEscapeUtils which does exactly what you want! The method you'd want to use is escapeXml, I presume.

stdll
  • 687
  • 4
  • 21
  • No. It escapes everything, also the <,> symbols of the proper tags. This is because it doesn't understand the xml structure but uses simple replacement of strings. – juanmirocks Dec 13 '11 at 12:49
  • Hmm, if I think some more about it... depending on what you want to do with the input it would be best to parse it with an XML parser. I doubt that tools which operate solely on strings know the difference between < and > which are part of the tags and those which are part of simple text. Also, using a parser allows you to process the input further. I don't know your exact use case, though. – stdll Dec 13 '11 at 13:14