1

This is about parsing inline CSS style properties of HTML. I'm using JSoup but so far as I'm able to ascertain JSoup has chosen not to help with this... I'm not sure why. It means that the users have to find out the rules for legal characters in keys and values, etc., i.e. what constitutes "properly formed" CSS style "attributes" (is this even the correct term? [later: no! style "properties", according to CSSParser]).

Anyway, what I want to do, in extracting each individual key-value pair, is to divide them up by semicolon... but in the last pair the trailing semicolon is optional. However, allowing for white space it will end with the end of the String.

So I tried this:

Pattern styleSubattrsPattern = Pattern.compile( "([A-Za-z0-9-]+)\\s*:\\s*([A-Za-z0-9-]+)\\s*[$;]");

... meant to mean "either a semicolon OR the end of the String". But it doesn't work: the final key-value pair is not matched.

later

The root problem was indeed solved by using CSSParser.

mike rodent
  • 14,126
  • 11
  • 103
  • 157
  • `(?![^;])` should do the trick if you don't want to match the `;` (it means *not followed by a character that is not a `;`*) – Casimir et Hippolyte Feb 18 '17 at 21:09
  • Interesting... that seems to work... Tx for the explanation... just trying to get my head around that! – mike rodent Feb 18 '17 at 21:12
  • In fact your phrase in italics sums up precisely what is needed... and presumably corresponds to the way the browser algorithms have to parse this ... so maybe you should make an answer with it? – mike rodent Feb 18 '17 at 21:24
  • It seems that you are [looking for a java CSS parser](http://stackoverflow.com/questions/1513587/looking-for-a-css-parser-in-java). Beware of regex limitations. – Patrick Parker Feb 18 '17 at 21:58
  • @PatrickParker yes, we're often encouraged by experienced SO users to say **why** we're posing a particular question, which is why I explained where I was coming from. Great link: do you recommend one in particular? – mike rodent Feb 18 '17 at 22:02

2 Answers2

2

Using [$;] will match either a semicolon or dollar sign, as enclosing special characters inside [] will refer instead to the character literal (except for a ^ at the start, for inverted match).

What you probably want is this: ((;)|($))

Alternatively, you could also use the question mark to denote an optional character, if you expect an end-of-line after the semicolon: ;?$.

Tyzoid
  • 1,072
  • 13
  • 31
  • Excellent... thanks for the explanation. And yes, the *final trailing* semicolon is of course optional: in fact I've now put `((;?)|($))`. – mike rodent Feb 18 '17 at 20:49
  • No problem @mikerodent, let me know how that works out for you. – Tyzoid Feb 18 '17 at 20:50
  • PS it *appears* that in fact you don't need the inner brackets: this appears to work OK: `(;?|$)`. – mike rodent Feb 18 '17 at 20:57
  • 1
    I think you want `(;|$)`, because `(;?|$)` can match anywhere – Patrick Parker Feb 18 '17 at 20:59
  • @mikerodent See Patrick's comment. I used the extra parentheses to ensure it matched properly, as I'm not as familiar with Java's regex engine. – Tyzoid Feb 18 '17 at 21:02
  • @PatrickParker I'm not sure that you're right about that: if you inadvertently miss out a semicolon in your HTML, for example, the match won't happen because after the end of your "value" characters, and any trailing white space, any subsequent characters other than ";" will cause the match not to happen. Certainly identifying "malformed" CSS style attributes is another question, but 1) I don't think `(;?|$)` will "match anywhere" and 2) you have to allow for the final attribute having a trailing ";"... (I wish JSoup would handle this dull stuff!). – mike rodent Feb 18 '17 at 21:06
  • @mikerodent `(;?|$)` will match: a semicolon, nothing, or end of line. Since it can match nothing, it can match anywhere, as a zero-length match is a match. `echo 'testing:hi' | grep -Eo '^.(;?|$)'` will match 't', even though it is neither end-of-line or a semicolon. – Tyzoid Feb 18 '17 at 21:12
  • Yes, but it doesn't matter if it matches "nothing"... it won't extend the match beyond the end of any trailing white space after the "value" characters... or am I going mad? I haven't even yet had a glass of wine so far this evening. – mike rodent Feb 18 '17 at 21:17
  • @mikerodent if that's what you want then you could just write `;?` because matching the end of string is redundant. (but I doubt that the separator is really optional.) – Patrick Parker Feb 18 '17 at 21:31
  • @PatrickParker OK now I have poured a first glass of wine... but I couldn't just write `;?` because that would mean that with `` the code would say "yup, 3 perfectly legitimate matches"... – mike rodent Feb 18 '17 at 21:37
  • @mikerodent obviously... that's why I said I think you want `(;|$)`. I think you are missing somehow that `(;?|$)` is equivalent to `;?` which was the entirety of my point. – Patrick Parker Feb 18 '17 at 21:47
  • @I concur... you're absolutely right. In fact it turns out to be fiendishly simple ;-) – mike rodent Feb 18 '17 at 21:57
1

You can do it using a negative lookahead assertion and a negated character class: (?![^;])

This handle the two cases:

  • if there's a character, this one can only be a ;
  • otherwise, only the end of the string (no characters) is allowed.

so:

Pattern styleSubattrsPattern = Pattern.compile( "([A-Za-z0-9-]+)\\s*:\\s*([A-Za-z0-9-]++)\\s*+(?![^;])");

(I added possessive quantifiers to forbid backtracking and avoid useless tests: * => *+ and + => ++)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125