2

Let's say you have given an input which could look like this (identifier1 identifier_2 23 4).

I want to add a # symbol after every identifier, which can contain letters, digits and underscores. They can only start with a letter followed by variations of letters, digits and underscores. My approach was something like this:

input.replaceAll("[A-Za-z0-9_]+", "$0#");

However, this also puts # symbols after every single digit which I wanted to exclude. The result should be (identifier1# identifier_2# 23 4). Is it possible to solve this problem with regex?

Tim Lâm
  • 124
  • 1
  • 2
  • 13
  • Doesn't it [work as expected](https://regex101.com/r/bL7kJ8/1)? You mean, no `#` should appear after `23` and `4`? – Wiktor Stribiżew Feb 23 '16 at 13:15
  • I only want to put `#` symbols after identifiers but not after digits. So it should be (identifier1# identifier_2# 23 4) – Tim Lâm Feb 23 '16 at 13:17
  • I think `\b(?!\d+\b)[A-Za-z0-9_]+\b` can help. But it will not exclude strings like `_____`. To exclude thiose, you can further restrict with `\b(?!_+\b|\d+\b)[A-Za-z0-9_]+\b`. Or even [`\b(?!\d+\b)[A-Za-z0-9]+(?:_[A-Za-z0-9])*\b`](https://regex101.com/r/bL7kJ8/2). – Wiktor Stribiżew Feb 23 '16 at 13:20
  • I assume that the parenthesis are not part of your input - if they are, then additional rules are needed. – tucuxi Feb 23 '16 at 13:25

3 Answers3

6

UPDATE 2

The Incremental Java says:

  • Each identifier must have at least one character.
  • The first character must be picked from: alpha, underscore, or dollar sign. The first character can not be a digit.
  • The rest of the characters (besides the first) can be from: alpha, digit, underscore, or dollar sign. In other words, it can be any valid identifier character.

    Put simply, an identifier is one or more characters selected from alpha, digit, underscore, or dollar sign. The only restriction is the first character can't be a digit.

So, you'd better use

String pattern = "(?:\\b[_a-zA-Z]|\\B\\$)[_$a-zA-Z0-9]*+";

See the regex demo

UPDATE

Acc. to Representing identifiers using Regular Expression, the identifier regex is [_a-zA-Z][_a-zA-Z0-9]*.

So, you may use

String pattern = "\\b[_a-zA-Z][_a-zA-Z0-9]*\\b";

NOTE that it allows _______.

You can use

String p = "\\b_*[a-zA-Z][_a-zA-Z0-9]*\\b";

To avoid that. See IDEONE demo.

String s = "(identifier1 identifier_2 23 4) ____ 33"; 
String p = "\\b_*[a-zA-Z][_a-zA-Z0-9]*\\b";
System.out.println(s.replaceAll(p, "$0#"));

Output: (identifier1# identifier_2# 23 4) ____ 33

OLD ANSWER

You can use the following pattern:

String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*\\b";

Or (if a _ can appear at the end):

String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]*)*\\b";

See the regex demo

The pattern requires that the whole word (as the expression is enclosed with word boundaries \b) should not be equal to a number (it is checked with (?!\d+\b)), and the unrolled part [A-Za-z0-9]+(?:_[A-Za-z0-9])* matches non-underscore word character chunks that are followed by zero or more sequences of an underscore followed with non-underscore word character chunks.

IDEONE demo:

String s = "(identifier1 identifier_2 23 4) ____ 33"; 
String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]*)*\\b";
System.out.println(s.replaceAll(p, "$0#")); 

Output: (identifier1# identifier_2# 23 4) ____ 33

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    That expression seems rather complex for what is required. `(?i)[a-z_]\w*` should be sufficient (or equivalents as posted in other answers/comments). – Thomas Feb 23 '16 at 13:34
  • No idea what exactly is required. I suggest alternatives. Should the input end with `_` or not? I suggest a solution to that, too. – Wiktor Stribiżew Feb 23 '16 at 13:35
  • However, if two regular expressions are equivalent, then the shortest, or most understandable, should be preferred. – tucuxi Feb 23 '16 at 13:36
  • They are not equivalent. – Wiktor Stribiżew Feb 23 '16 at 13:36
  • No they are not but probably achieve the same goal. Why check if it's a number if the requirement for `[a-zA-Z]` at the start already rules out that option? – Thomas Feb 23 '16 at 13:39
  • I have no test cases to check. If you prove these regexps work the same, I will agree it is useless. Right now, as I said, it is an **alternative**. – Wiktor Stribiżew Feb 23 '16 at 13:44
  • The OP did not mention "valid Java identifiers" as a goal. If that were the goal, then the question should be edited. – tucuxi Feb 23 '16 at 14:14
4

Your current regex says

one or more upper or lower-case letters, digits, or underscores, in whatever order.

According to that regex, 54 is a valid identifier.

You actually wanted to write

a letter, followed by any number of letters, digits or underscores, in whatever order

That would be written in code as:

input.replaceAll("[A-Za-z][A-Za-z0-9_]*", "$0#");

Wiktor notes that this regex will still match "identifiers" that are inside something that is not identifier-ish. To solve this, you could use the following variation:

input.replaceAll("\\b([A-Za-z][A-Za-z0-9_]*)\\b", "$1#")

This rejects 123ab123 as a valid identifier, but accepts ab123 in 123 ab123

tucuxi
  • 17,561
  • 2
  • 43
  • 74
0

If you want to use java to read java, java's got you covered: "\\b\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*\\b"

Not Saying
  • 194
  • 11