0

I have a text file that contains some sequence of unicode characters value like

"{"\u0985\u0982\u09b6\u0998\u099f\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u0982\u09b6\u09bf","\u0985\u0982\u09b6\u09be\u0999\u09cd\u0995\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u09a6\u09bf","\u0985\u0982\u09b6\u09be\u09a8\u09cb"}"

I am trying to match and group values inside the quotes using Pattern class in java like below but can not find any match.

Pattern p = Pattern.compile("\"(\\[u]{1}\\w+)+\"");

Example

I am actually willing to find out where is the technical error in my given regexp.

Rakib
  • 145
  • 13
  • possible duplicate of [Matching (e.g.) a Unicode letter with Java regexps](http://stackoverflow.com/questions/5315330/matching-e-g-a-unicode-letter-with-java-regexps) – ryekayo Mar 18 '15 at 20:53
  • check unicode matching http://www.regular-expressions.info/unicode.html – Federico Piazza Mar 18 '15 at 21:14
  • I took the whole given portion as a single string and from there I wanted to capture the quoted portions. Does there really apply the rules for 'unicode matching'? Because, all I am intending to do is to extract a set of chars inside a quote along with quote! – Rakib Mar 18 '15 at 21:27
  • have you tried checking your regexp in [regexr](http://www.regexr.com)? – ha9u63a7 Mar 18 '15 at 21:43
  • The normal regexp works just fine, added the link at the bottom. I guess, there is something going wrong while preparing the string. – Rakib Mar 18 '15 at 21:58

2 Answers2

1

Try something more like this:

Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\"");

In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.

Ian Roberts
  • 120,891
  • 16
  • 170
  • 183
0

First, this bit [u]{1} means that you want to match values from the list only once, so you can replace it with simply u

Once that is done, your regex wants to match a quote, a slash, then a u, then another slash, then one or more w's, then a slash. It is matching w's instead of word characters because you have too many slashes before it.

Happy coding!

Edit
Try replacing the \\ before the u with a \\\\. \u is not valid in some regex's and so when you put in a Java string, it's probably becoming \u, breaking the regex

Blue0500
  • 715
  • 8
  • 16
  • two slashes before 'w' is necessary as the first '\' actually tells the string to take the next '\' literally which turns out to be '\w' after the string interpretation is done. – Rakib Mar 18 '15 at 21:48
  • On regex101, when I remove one of the slashes, it works. I think there is an escaping problem somewhere then – Blue0500 Mar 18 '15 at 21:51