How do you tell if unicode letters are sequential in Java?

Question

The general requirment is that I need to implement a method for passwords that does not accept three sequential letters or numbers - so no 'abc123' passwords.

I need a way to see if three letters are sequentially after each other - obviously with any single language this is fairly simple, but a general purpose code for every unicode language seems to escape me.

I assume first I would need a method of figuring out if the three characters are in the same language, and then figure out if they are sequentially after each other. In unicode, there are also languages that are not ordered in any particular way - so there would need to be a way to tell if we were in a language that had order or not.

Is this as complicated as I'm imagining, or are there Java libraries / inherent patterns within unicode that allow something like this?

If I were to reduce the requirements, so that I would just numerically compare the unicode numbers to each other, are there any real world scenarios that I would run into trouble with? i.e. is it likely that someone would choose a password that contained the two ending letters of one language and the first of the next, in a valid way?

I'm still brainstorming. The last paragraph I can write in a minute, but I think it's too simple for what I need. And I don't want to spend hours on this requirement, I'd rather go to the stakeholders and let them decide whether it's worth a weeks worth of effort for this silly requirement. — RankWeis, Nov 10 '13 at 22:16
'sequential'? There are at least a dozen variations on 'the letter a' in Unicode. How will you decide which sequence to enforce? — bmargulies, Nov 10 '13 at 22:20
I'm half hoping for a 'simple' solution, but I'm also hoping that this post can be used by myself and future people to explain why this isn't possible. — RankWeis, Nov 10 '13 at 22:22
@Christian I don't think "trying stuff" would be helpful. The difficulty in answering his question is that it requires an in-depth knowledge of the unicode numbering system, and various languages. Very very few of us do. Without this knowledge, its difficult to validate the merit and effectiveness of anything one would "try". — goat, Nov 10 '13 at 22:23
Have a look at [this question](http://stackoverflow.com/q/3200292/1324631). — AJMansfield, Nov 10 '13 at 22:54
the requirement is probably nonsensical. "cat", 1337 are not consecutive, but very weak. — ZhongYu, Nov 11 '13 at 01:07
It's one of the many requirements, neither of those would pass either. — RankWeis, Nov 11 '13 at 02:31
IIRC, a similar *requirement* was one of the *weaknesses* that made it easier to crack the Engima machines. — Raedwald, Nov 11 '13 at 14:52

score 2 · Answer 1 · answered Nov 10 '13 at 22:18

2

If I were you I would get the unicode position of the char and check if the next character has position of the first + 1 - This should work for all languages since Unicode code points should be sorted.

answered Nov 10 '13 at 22:18

Florian Loch

809
7
12

1

u+007A is the letter 'z' and U+007B is {, U+007C is |, but the character sequence z{| should be allowed. – RankWeis Nov 10 '13 at 22:20
1

Than you could limit the range of this check to western characters - that would be numbers 0-9 and characters a-z and A-Z. That should cover a lot of languages. – Florian Loch Nov 10 '13 at 22:26
Thanks, this was my thinking as well - I was going to suggest to do western and Japanese as this would cover the vast majority of our users. However, my company cares a lot about the user experience in even the smallest of our locales, and convincing them to change the experience for them is difficult and requires talking to people high up the ladder. – RankWeis Nov 10 '13 at 22:31
@RankWeis Of course, the requirement makes only sense (if at all) if we restrict it to characters in the same category, for example "letters" or "digits", but not for graphical characters, so "{|}" is allowed, but not "äöü" – Ingo Nov 11 '13 at 14:35

score 1 · Answer 2 · answered Nov 10 '13 at 22:31

1

Probably Character.isLetter(c) fits your needs. The following unittest runs trough.

package snippets;

import static org.junit.Assert.*;

public class LetterTest {

    @Test
    public void test3Uni() throws Exception {
        String s = "汉语漢語";
        for (char c : s.toCharArray()) {
            assertTrue(Character.isLetter(c));
        }
    }

}

There is a Character.isDigit(d) too.

answered Nov 10 '13 at 22:31

Niklaus Bucher

134
1
6

Or you can use a regex `Pattern p = Pattern.compile(".*([0-9]{3}|[a-zA-Z]{3}).*");` but only if your characters are between a-z or A-Z – Niklaus Bucher Nov 10 '13 at 22:40

Joop Eggen · Answer 3 · 2013-11-10T22:52:53.280

You could do search whether there are 3 consecutive code points that are in the same Unicode block. With an extra condition isLetterOrDigit(cp).

Character.UnicodeBlock oldBlock = 0;
int oldCp = 0;
int n = 0;
for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
    Character.UnicodeBlock block = Character.UnicodeBlock.of(cp);
    if (n != 0 && block == oldBlock && cp == oldCp + 1 && isLetterOrDigit(cp)) {
        ++n;
        oldCp = cp;
        if (n >= 0) {
            return false;
        }
    } else {
        n = isLetterOrDigit(cp) ? 1 : 0;
        oldCp = cp;
        oldBlock = block;
    }
}
return true;

score 0 · Answer 4 · answered Nov 11 '13 at 14:25

This is not a meaningful requirement.

First off, even if it were possible to define an absolute sequence of every code point, Unicode is a moving target. New code points are added into the unassigned gaps with every release.

From the Unicode Collation Algorithm Introduction:

Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently.

Unicode defines a default sort order, but it may defy user expectations. An English speaker would consider stu to be a consecutive sequence. But consider U+00DF sharp s ß. If you include this in a string and sort using English locale Java collation rules you will end up with sßtu.

The introduction goes on to say:

For scripts and characters not used in a particular language, explicit rules may not exist. For example, Swedish and French have clearly specified, distinct rules for sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines the ordering of characters such as Ж, ש, ♫, ∞, ◊, or ⌂.

You cannot expect a single ordering to be meaningful to every user because of i18n concerns. The best you can do is build a few heuristics for individual languages.

How do you tell if unicode letters are sequential in Java?

4 Answers4