Java split a CSV ignoring HTML characteres

Question

I need to split a string by semicolon ignoring the semicolons that may come as HTML characters. For instance, given the string:

id=com.google.android;keywords=Android&#59;Operating System&#59;Phone;versions=Gingerbread&#59;ICS&#59;JB

I need to split it into:

id = com.google.android
keywords=Android&#59;Operating System&#59;Phone
versions=Gingerbread&#59;ICS&#59;JB

any ideia how to do this?

Can't you get the String with another separator (for example '|')? — Axel, Jan 18 '13 at 14:58
@RohitJain I tried messing around with look around but with no results on sight... I did not post my tryouts since they were not being compiled as good regex pattern — Miguel Ribeiro, Jan 18 '13 at 15:36
Would it be ok to replace the HTML-entities with their respective unicode characters first? Then you could just split on ';'. You could setup your own replacement table or use some exsiting library (also read http://stackoverflow.com/questions/994331/java-how-to-decode-html-character-entities-in-java-like-httputility-htmldecode). — Axel, Jan 18 '13 at 15:40
@Axel the problem with that in this case is that `;` represents the semicolon character... — Ian Roberts, Jan 18 '13 at 15:54

Ian Roberts · Answer 1 · 2013-01-18T17:36:44.827

A regex like (?<!&#?[0-9a-zA-Z]+); would probably do it. This would prevent matching a semicolon that terminates an entity reference or character reference, though it also catches a few cases that are not technically either by the specs (e.g. it wouldn't match the semicolon at the end of &#foo; or &123;).

(?<!...) is a "negative lookbehind", so you can read this regex as matching a semicolon that is not preceded by a substring that matches &#?[0-9a-zA-Z]+ (i.e. ampersand, optional hash, and one or more alphanumerics). However lookbehinds must have an upper bound on the number of characters they can match, which + doesn't, so you'll have to use a bounded repetition count, like {1,5} rather than the unbounded +. The upper bound needs to be at least as long as the longest entity reference you might see, and if your data might contain arbitrary entity references then you'll have to use something like the length of the string as the upper bound.

String[] keyValuePairs = theString.split(
    "(?<!&#?[0-9a-zA-Z]{1," + theString.length() + "});");

If you can specify a smaller bound then that would probably be more efficient.

Edit: Android apparently doesn't like this lookbehind, even with bounded repetition, so you probably won't be able to use a single regex with String.split to do what you're after, you'll have to do the looping yourself, e.g.

Pattern p = Pattern.compile("(?:&#?[0-9a-zA-Z]+)?;");
Matcher m = p.matcher(theString);
List<String> splits = new ArrayList<String>();
int lastEltStart = 0;
while(m.find()) {
  if(m.end() - m.start() > 1) {
    // this match was an entity/character reference so don't split here
    continue;
  }
  if(m.start() > lastEltStart) {
    // non-empty part
    splits.add(theString.substring(lastEltStart, m.start()));
  }
  lastEltStart = m.end();
}
if(lastEltStart < theString.length()) {
  // non-empty final part
  splits.add(theString.substring(lastEltStart));
}

Thanks! I'm no pro with regex and was not being able to achieve a solution like yours... I just found a glitch on the pattern. If I write it like you posted it I get the exception "Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 18 (?<!?[0-9a-zA-Z]+);" . So, to make it work, I replaced + by {1, 99999999999} . I just cannot find a reason for this magic number to work... — Miguel Ribeiro, Jan 18 '13 at 15:51
@MiguelRibeiro Java has obviously inherited the Perl restriction that lookbehinds must have bounded length. I've edited the answer with some suggestions. — Ian Roberts, Jan 18 '13 at 16:03
Just checked... It seems that, even with the limit, the Android SDK throws the same exception... It uses Java 1.5 (and tested with a spike solution using java 1.6) so I don't know if there's any detail that I'm missing... — Miguel Ribeiro, Jan 18 '13 at 16:25
@MiguelRibeiro in that case you'll have to implement the looping logic yourself rather than doing it in one regex with `String.split` — Ian Roberts, Jan 18 '13 at 17:13

score 0 · Accepted Answer · answered Jan 18 '13 at 20:06

0

Since the HTML entites have only two or three numbers between the '&#' and ';' I used the following regex:

(?<!&#\d{2,3});

answered Jan 18 '13 at 20:06

Miguel Ribeiro

8,057
20
51
74

Java split a CSV ignoring HTML characteres

2 Answers2