Checking if character is a part of Latin alphabet?

Question

I need to test whether character is a letter or a space before moving on further with processing. So, i

    for (Character c : take.toCharArray()) {
        if (!(Character.isLetter(c) || Character.isSpaceChar(c)))
            continue;

        data.append(c);

Once i examined the data, i saw that it contains characters which look like a unicode representation of characters from outside of Latin alphabet. How can i modify the above code to tighten my conditions to only accept letter characters which fall in range of [a-z][A-Z]?

Is Regex a way to go, or there is a better (faster) way?

Wait, why do you consider "é" to not be a letter? Usually people are looking for ways to make their code handle international input *better*, not *worse*... — Borealid, Feb 06 '12 at 02:11
@Borealid, In my case the control character is an oddity, which i am currently further investigating. `é` certainly is a valid character, which for the purposes of my program should not be there. — James Raitsev, Feb 06 '12 at 02:13
The regex to do this is to check against the Latin script property with `\p{sc=Latin}`. — tchrist, Feb 06 '12 at 02:51
Related: [*Identify if a Unicode code point represents a character from a certain script such as the Latin script?*](https://stackoverflow.com/q/62109781/642706) — Basil Bourque, May 31 '20 at 04:53

score 18 · Answer 1 · edited Feb 06 '12 at 02:18

18

If you specifically want to handle only those 52 characters, then just handle them:

public static boolean isLatinLetter(char c) {
    return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
}

edited Feb 06 '12 at 02:18

Louis Wasserman

191,574
25
345
413

answered Feb 06 '12 at 02:14

Ernest Friedman-Hill

80,601
10
150
186

score 4 · Accepted Answer · answered Feb 06 '12 at 02:32

4

If you just want to strip out non-ASCII letter characters, then a quick approach is to use String.replaceAll() and Regex:

s.replaceAll("[^a-zA-Z]", "")

Can't say anything about performance vs. a character by character scan and append to StringBuilder, though.

answered Feb 06 '12 at 02:32

Alistair A. Israel

6,417
1
31
40

It appears in my testing that going 1 character at a time is about 30% faster. But certainly a valid suggestion and approach. Thank you – James Raitsev Feb 06 '12 at 02:41
2

I'd be curious to see the results of with `s.replaceAll("[^a-zA-Z]+", "")` and `s.replaceAll("[^a-zA-Z]*", "")`. – Samuel Edwin Ward Feb 06 '12 at 02:54
2

@SamuelEdwinWard Wow. Twice as fast as `[^a-zA-Z]+` Faster then the one by characters – James Raitsev Feb 06 '12 at 03:26

score 1 · Answer 3 · answered Feb 06 '12 at 02:19

1

I'd use the regular expression you specified for this. It's easy to read and should be quite speedy (especially if you allocate it statically).

answered Feb 06 '12 at 02:19

Samuel Edwin Ward

6,526
3
34
62

Could you provide an example to do it the right way? I'd like to see what's faster. – James Raitsev Feb 06 '12 at 02:27
It's getting rather late in the day in this locality, so I'm afraid you'll have to wait for code, particularly if you want it to compile :) – Samuel Edwin Ward Feb 06 '12 at 02:50
But, as an aside, you might be overly concerned with speed at this time. Surely this isn't the slowest operation you're performing? It might be more efficient to optimize the time that a future developer (who might be you!) spends trying to understand this bit of code. – Samuel Edwin Ward Feb 06 '12 at 02:52

Checking if character is a part of Latin alphabet?

3 Answers3