12

I need to convert any arbitrary string:

  • string with spaces
  • 100stringsstartswithnumber
  • string€with%special†characters/\!
  • [empty string]

to a valid Java identifier:

  • string_with_spaces
  • _100stringsstartswithnumber
  • string_with_special_characters___
  • _

Is there an existing tool for this task?

With so many Java source refactoring/generating frameworks one would think this should be quite common task.

parxier
  • 3,811
  • 5
  • 42
  • 54
  • Are you looking to do this dynamically at runtime? If so, that's not going to work. You'll need a `Map` or something to do this. – corsiKa Sep 16 '11 at 06:12
  • @glowcoder Yes, I need an utility method to do this at runtime. Please clarify why would I need to use `Map`? – parxier Sep 16 '11 at 06:14
  • Because there is no dynamic mapping to variables at runtime. The best you could do is use reflection. And honestly if there was it is indicative of a poor design decision. Why would you want to do something like this? Typically there is a much better (and safer!) way to do it. – corsiKa Sep 16 '11 at 06:15
  • @glowcoder Maybe I didn't describe what I'm trying to do clear enough. I don't what to do any "dynamic mapping to variables at runtime". All I want to do is convert string into another string that is guaranteed to comply with Java identifier naming rules (no spaces, don't start with number, etc). Elipse's 'Extract to constant' refactoring option does just that for string and numbers. I'm looking for a utility method I can use in my app to do similar thing. – parxier Sep 16 '11 at 06:21
  • 1
    I guess the hibernate class ImprovedNamingStrategy does something similar. – Scorpion Sep 16 '11 at 06:47

4 Answers4

12

This simple method will convert any input string into a valid java identifier:

public static String getIdentifier(String str) {
    try {
        return Arrays.toString(str.getBytes("UTF-8")).replaceAll("\\D+", "_");
    } catch (UnsupportedEncodingException e) {
        // UTF-8 is always supported, but this catch is required by compiler
        return null;
    }
}

Example:

"%^&*\n()" --> "_37_94_38_42_10_56_94_40_41_"

Any input characters whatsoever will work - foreign language chars, linefeeds, anything!
In addition, this algorithm is:

  • reproducible
  • unique - ie will always and only produce the same result if str1.equals(str2)
  • reversible

Thanks to Joachim Sauer for the UTF-8 suggestion


If collisions are OK (where it is possible for two inputs strings to produce the same result), this code produces a readable output:

public static String getIdentifier(String str) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        if ((i == 0 && Character.isJavaIdentifierStart(str.charAt(i))) || (i > 0 && Character.isJavaIdentifierPart(str.charAt(i))))
            sb.append(str.charAt(i));
        else
            sb.append((int)str.charAt(i));
    }
    return sb.toString();
}

It preserves characters that are valid identifiers, converting only those that are invalid to their decimal equivalents.

thSoft
  • 21,755
  • 5
  • 88
  • 103
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 2
    If this should be reproducable and stable, then `getBytes()` should take an argument (I suggest `"UTF-8"`). – Joachim Sauer Sep 16 '11 at 06:26
  • 1
    Although it is not a code from existing lib, it's the most elegant peace of code from those provided. I'm accepting this as an answer. – parxier Sep 19 '11 at 06:23
  • My source code text file encoding is UTF-8, then, the solution is NOT working (Java 1.8.0.45): getIdentifier("hallo") produces "_104_97_108_108_111_" – Hartmut Pfarr Oct 02 '15 at 13:49
  • @Hartmut Impossible. `Arrays.toString()` outputs a `[` as the first character, which matches `\D`, so you'll always get a `_` as the first char. I ran `getIdentifier("hallo")` on Java 1.8.0.45 too and got `_104_97_108_108_111_` (with the leading underscore) which is a valid identifier. – Bohemian Oct 02 '15 at 15:53
  • @Bohemian sorry my fault, did a wrong paste. It produces `_104_97_108_108_111_` (same as You got). On one side, this is truely a valid java identifier. Nevertheless, I'd wish getting "hallo" as output, since no special characters occur. – Hartmut Pfarr Oct 04 '15 at 12:55
  • @Hartmut the problem with leaving characters untouched is uniqueness. If we allowed valid identifier chars to be left as-is and only converted invalid identifier chars to `"_n_"` where `n` is the code point, consider the input string `"hallo\nworld"` and `"hallo_10_world"` - both would produce output of `"hallo_10_world"`. Unless we convert *all* chars, it will always to be possible to have two different inputs produce the same result, which fails OP's requirements IMHO. That said, I have added code that should be to your liking :) – Bohemian Oct 04 '15 at 15:48
  • 3
    From Java 7 you may use [nio.charset.StandardCharsets](http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html) to avoid the _try-catch_ block. `Arrays.toString(str.getBytes(StandardCharsets.UTF_8)).replaceAll("\\D+", "_");` – chrisjleu May 12 '16 at 11:50
3

I dont't know a tool for that purpose, but it can be easily created using the Character class.

Did you know that string€with_special_characters___ is a legal java identifier?

public class Conv {
    public static void main(String[] args) {
        String[] idents = { "string with spaces", "100stringsstartswithnumber",
                "string€with%special†characters/\\!", "" };
        for (String ident : idents) {
            System.out.println(convert(ident));
        }
    }

    private static String convert(String ident) {
        if (ident.length() == 0) {
            return "_";
        }
        CharacterIterator ci = new StringCharacterIterator(ident);
        StringBuilder sb = new StringBuilder();
        for (char c = ci.first(); c != CharacterIterator.DONE; c = ci.next()) {
            if (c == ' ')
                c = '_';
            if (sb.length() == 0) {
                if (Character.isJavaIdentifierStart(c)) {
                    sb.append(c);
                    continue;
                } else
                    sb.append('_');
            }
            if (Character.isJavaIdentifierPart(c)) {
                sb.append(c);
            } else {
                sb.append('_');
            }
        };
        return sb.toString();
    }
}

Prints

string_with_spaces
_100stringsstartswithnumber
string€with_special_characters___
_
stacker
  • 68,052
  • 28
  • 140
  • 210
  • 1
    This is awesome. It allows `þ` as an identifier. This means combined with a conditional operator, you can do things like `$? 8:þ` Smiley faces in code are always a good thing, right? Right?! – corsiKa Sep 16 '11 at 06:59
2

If you are doing this for autogenerated code (i.e. don't care much about readability) one of my favorites is just to Base64 it. No need to play language lawyer over what characters are valid in what encodings, and it's a pretty common way of "protecting" arbitrary byte data.

Steven Schlansker
  • 37,580
  • 14
  • 81
  • 100
0

With so many Java source refactoring/generating frameworks one would think this should be quite common task.

Actually it is not.

  • A code refactoring framework will start with existing valid java identifiers, will be able to generate a new identifier by concatenating them with some additional characters for disambiguation purposes.

  • A typical code generation framework will start out with "names" taken from a restricted character set. It won't have to deal with arbitrary characters.


I presume that the aim of your converter is to produce identifiers that resemble the input strings if this is possible. If that's the case, I would do the conversion by mapping all legal identifier characters as-is, and replace illegal identifier characters with "$xxxx" where "xxxx" is a 4 digit hex encoding of the Java 16-bit character.

Your scheme works too, but replacing all illegal characters with '_' is more likely to result in identifier collisions; i.e. where two input strings map to the same identifier.

This is straight-forward to code, so I'll leave it for you to do.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Elipse's 'Extract to constant' refactoring option has to deal with with arbitrary characters as it generates constant names out of arbitrary strings and numbers. – parxier Sep 16 '11 at 07:34
  • 1
    @parxier - 1) that's an unusual case. 2) why don't you look at the Eclipse code base to see if you can reuse the code – Stephen C Sep 16 '11 at 23:22
  • 2) I would if I know where to start, it's a huge project. – parxier Sep 19 '11 at 03:55
  • @parxier - if you need to work on your skills in understanding large codebases, this would be a good example to start with. [ Or to put it another way, I ain't going to do your research for you :-) ] – Stephen C Sep 19 '11 at 06:09