2

I'm trying to "play around" with some REST APIs and Java code.

As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.

Still I can't get my regex to match my words correctly.

I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....

Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.

So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).

Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.

Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
renegade2k
  • 31
  • 4
  • 1
    Just a comment here, if you're trying to parse JSON with regex, just **don't**. Use a JSON parser. – Mena Mar 29 '18 at 12:12
  • I tried JSON first, but in my mind it was easier to handle the REST requests with the HTTP Client just as plain text and put/get commands over URL. For this i built myself some handy methods to work with the plain text and regex is part of it ;) – renegade2k Mar 29 '18 at 12:14
  • 1
    you are building your own padded cell, that is. Trust me on that one. – Mena Mar 29 '18 at 12:18
  • Why don't you just capture any character until the next following `"`? (ie. `.*?type\":\"(.+?)(?<!\\)\"`, but, don't parse json with regexes, anyway) – guido Mar 29 '18 at 12:40
  • Worth a try. Seems legit ^^ – renegade2k Mar 29 '18 at 12:48
  • @ᴳᵁᴵᴰᴼ Will that work with `"type":"C:\\Program Files\\"`? – VGR Mar 29 '18 at 19:32

1 Answers1

1

If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:

String regex = ".*?type\":\"[\\p{L}_]+\"";

With on-line example: https://regex101.com/r/57oFD5/2

\p{L} matches any kind of letter from any language

_ matches the character _ literally (case sensitive)

This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • Thx for the hint. I tried the (updated) code, but for some reason my script is hanging now. Maybe for some other reason, maybe not ... i will find out ;) – renegade2k Mar 29 '18 at 12:42