2

I created two patterns to compile ls -l outputs in RegexBuddy, but in Android pattern compile gives me an error. In Java 7 it compiled fine.

The raw patterns are

  1. (^[l,d,-][-,r,w,x]{9})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([0-9]{0,})[\t,\s]{1,}([0-9]{4}-[0-9]{1,2}-[0-9]{1,2}\s[0-9]{2}:[0-9]{2})[\t,\s]{1,}(.{1,})
  2. (^[l,d,-][-,r,w,x]{9})[\t,\s]{1,}[0-9]{1,}[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([0-9]{0,})[\t,\s]{1,}(\w{3}\s[0-9]{1,2}[\t,\s]{1,}([0-9]{1,2}:[0-9]{2}|[0-9]{4}))[\t,\s]{1,}(.{1,})

The first is for matching

-rwxr-xr-x  1 doctor users    399 2011-11-11 13:33 shot.s

or

-rwxr-xr-x  1 100 100    399 2011-11-11 13:33 file.txt

The second is for matching

for matching

-rwxr-xr-x  1 doctor users    399 Nov 22  2011 shot.s

or

-rwxr-xr-x  1 100 100    399 Nov 22 13:33 shot.s

In code:

  1. private static final Pattern LS_L =
        Pattern.compile("(^[l,d,-][-,r,w,x]{9})[\\t,\\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\\t,\\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\\t,\\s]{1,}([0-9]{0,})[\\t,\\s]{1,}([0-9]{4}-[0-9]{1,2}-[0-9]{1,2}\\s[0-9]{2}:[0-9]{2})[\\t,\\s]{1,}(.{1,})");
    
  2. private static final Pattern LS_L_1 =
        Pattern.compile("(^[l,d,-][-,r,w,x]{9})[\\t,\\s]{1,}[0-9]{1,}[\\t,\\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\\t,\\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\\t,\\s]{1,}([0-9]{0,})[\\t,\\s]{1,}(\\w{3}\\s[0-9]{1,2}[\\t,\\s]{1,}([0-9]{1,2}:[0-9]{2}|[0-9]{4}))[\\t,\\s]{1,}(.{1,})");
    

The first one throws

02-24 21:14:21.854: E/AndroidRuntime(3072): Caused by: java.util.regex.PatternSyntaxException: Missing closing bracket in character class near index 219:
02-24 21:14:21.854: E/AndroidRuntime(3072): (^[l,d,-][-,r,w,x]{9})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([0-9]{0,})[\t,\s]{1,}([0-9]{4}-[0-9]{1,2}-[0-9]{1,2}\s[0-9]{2}:[0-9]{2})[\t,\s]{1,}(.{1,})
02-24 21:14:21.854: E/AndroidRuntime(3072):                                                                                                                                                                                                                            ^
02-24 21:14:21.854: E/AndroidRuntime(3072):     at java.util.regex.Pattern.compileImpl(Native Method)
02-24 21:14:21.854: E/AndroidRuntime(3072):     at java.util.regex.Pattern.compile(Pattern.java:400)
02-24 21:14:21.854: E/AndroidRuntime(3072):     at java.util.regex.Pattern.<init>(Pattern.java:383)
02-24 21:14:21.854: E/AndroidRuntime(3072):     at java.util.regex.Pattern.compile(Pattern.java:374)

The second one gives me

02-24 21:00:24.166: E/AndroidRuntime(1366): Caused by: java.util.regex.PatternSyntaxException: Missing closing bracket in character class near index 250:
02-24 21:00:24.166: E/AndroidRuntime(1366): (^[l,d,-][-,r,w,x]{9})[\t,\s]{1,}[0-9]{1,}[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([[a-z_][a-z0-9_]{0,30}|[0-9]{1,})[\t,\s]{1,}([0-9]{0,})[\t,\s]{1,}(\w{3}\s[0-9]{1,2}[\t,\s]{1,}([0-9]{1,2}:[0-9]{2}|[0-9]{4}))[\t,\s]{1,}(.{1,})
Yaroslav Mytkalyk
  • 16,950
  • 10
  • 72
  • 99
  • 10
    Good lord. I think the compiler is just sympathetic to anyone that ever has to look at that. – Dave Newton Feb 24 '13 at 19:35
  • what are you trying to match..given the regex i doubt it could be entirely done in regex unless you specify the pattern you are actually trying to match.. – Anirudha Feb 24 '13 at 19:41
  • @Some1.Kill.The.DJ I said I was matching ls -l output. It can be and must be done with regex. It works perfect in RegexBuddy. Updated the question with examples of output. – Yaroslav Mytkalyk Feb 24 '13 at 19:46

2 Answers2

3

For me, the error was removed by escaping the [ in the [[a-z_] character classes - two in each regex.

 [\\[a-z_]

Some regex implementations do not require [ to be escaped inside a character class, but java does because "character classes may appear within other character classes". See Character class subtraction and docs.

Incidentally, you could shorten your regex by replacing all the [0-9] with \\d and by removing the \\t from all the [\\t,\\s], as \\s also matches tabs, and by removing all the commas from your character classes e.g. [-,r,w,x] should be [-rwx].

And if you weren't worried about the match being case-insensitive you could replace all the [a-z0-9_] with \\w.

Edit

Looking again, there seems to be no reason to have the [ in the character classes anyway, so [[a-z_] should just be [a-z_].

Community
  • 1
  • 1
MikeM
  • 13,156
  • 2
  • 34
  • 47
  • +1 my first thought was that the useless `,` with multiple occurance could be a problem for a regexx parser that is a bit picky. – Ingo Feb 24 '13 at 22:07
  • Thanks. The [[ really caused the problem. And I used [0-9] instead of \\d because the documentation encourages to http://developer.android.com/reference/java/util/regex/Pattern.html – Yaroslav Mytkalyk Feb 25 '13 at 08:55
  • @DoctororDrive. Thanks for the reference: _"if you mean 0-9 use [0-9] rather than \d, which would also include Gurmukhi digits and so forth. "_ I didn't know that other digits would be included. That Android documentation seems pretty good. – MikeM Feb 25 '13 at 09:28
1

In addition to what @Mike said:

  1. replace {1,} with +
  2. [a-z_][a-z0-9_]{0,30} what would you match with that? Certainly not a UNIX filename, for this would rather be something like [^\0/]+
Ingo
  • 36,037
  • 5
  • 53
  • 100
  • [a-z_][a-z0-9_]{0,30} is an Unix username or group name – Yaroslav Mytkalyk Feb 25 '13 at 08:39
  • Yes, but why {0,30}? I can imagine having a user `u`, but not one without a name. You could just (\S+) here. – Ingo Feb 25 '13 at 12:27
  • [a-z_] means one letter without number, then goes zero to 30 letters with numbers allowed [a-z0-9_]{0,30} – Yaroslav Mytkalyk Feb 25 '13 at 13:05
  • But it just could be a number, or something like `Müller` – Ingo Feb 25 '13 at 13:44
  • Unix user name can't be a number since it could be treated as UID http://www.linuxquestions.org/questions/linux-newbie-8/user-name-restrictions-312024/ – Yaroslav Mytkalyk Feb 25 '13 at 14:08
  • Yes, and in that field we talk about, there goes a user name or a UID. Now, if you do something like ([0-9]+|[a-zA-Z][a-zA-Z0-9]+) it is again duplication. [a-zA-Z0-9]+ is just fine (but still doen't match all valid usernames) – Ingo Feb 25 '13 at 14:37
  • not a duplication. [a-zA-Z0-9]+ will match "123user" which is not a valid name. It should be either a number (UID) or start with a letter. – Yaroslav Mytkalyk Feb 25 '13 at 16:05
  • But an invalid user name will simply not appear in the output of ls -l. So why check for it? – Ingo Feb 25 '13 at 16:52