7

I have a string containing unicode symbols (cyrillic):

myString1 = 'Австрия'
myString2 = 'AustriЯ'

I want to check if all the elements in the string are English (ASCII). Now I'm using a loop:

for char in myString1:
    if ord(s) not in range(65,91):
         break

So if I find the first non-English element I break the loop. But for the given example you can see the string can contain a lot of English symbols and unicode at the end. In this way I will check the whole string. Furthermore, If all the string is in English I still check every char.

Is there any more efficient way to do this? I'm thinking about something like:

if any(myString[:]) is not in range(65,91)
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
  • 6
    Possible duplicate of [How to check if a string in Python is in ASCII?](https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii) – YGouddi Dec 26 '17 at 09:06
  • 1
    Hmm, `range(65,91)` is quite small: only uppercase letters! `'Austria`' would be rejected because it contains lower case letter and `ord('a')` is already 97. Not even speaking of punctuations and digits which are indeed in ASCII charset. What are you trying to do exactly? – Serge Ballesta Dec 26 '17 at 09:12
  • @SergeBallesta you are right. Actually I use two intervals: `range(65,91)` and `range(48,58)`. So I use double condition in `if` statement. – Mikhail_Sam Dec 26 '17 at 09:15
  • Shouldn't the "any" in the title be "all"? – timgeb Dec 26 '17 at 09:20
  • Would your strings _only_ have letters or could there be spaces and periods too? – cs95 Dec 26 '17 at 09:34
  • @cᴏʟᴅsᴘᴇᴇᴅ they can contain anything. Digits, punctuation, spaces and unicode – Mikhail_Sam Dec 26 '17 at 09:36
  • @Mikhail_Sam Ah, so none of these answers, nor your original solution would work, unless you account for them all. – cs95 Dec 26 '17 at 09:37

5 Answers5

13

You can speed up the check by using a set (O(1) contains check), especially if you are checking multiple strings for the same range since the initial set creation requires one iteration as well. You can then use all for the early-breaking iteration pattern which fits better than any here:

import string

ascii = set(string.ascii_uppercase)
ascii_all = set(string.ascii_uppercase + string.ascii_lowercase)

if all(x in ascii for x in my_string1):
    # my_string1 is all ascii

Of course, any all construct can be transformed to an any via DeMorgan's Law:

if not any(x not in ascii for x in my_string1):
    # my_string1 is all ascii

Update:

One good pure set based approach not requiring a complete iteration as pointed out by Artyer:

if ascii.issuperset(my_string1):
    # my_string1 is all ascii
user2390182
  • 72,016
  • 6
  • 67
  • 89
  • That's a clever solution, I like it! – user1767754 Dec 26 '17 at 09:18
  • Nice and elegant. +1 – Sohaib Farooqi Dec 26 '17 at 09:19
  • Interesting solution. Can you give me one more advice: what if I need not only ascii_uppercase, but all the ascii characters? Can I use just `(string.ascii)'? – Mikhail_Sam Dec 26 '17 at 09:24
  • @Mikhail_Sam I added that option: you have to combine `ascii_lowercase` and `ascii_uppercase` – user2390182 Dec 26 '17 at 09:25
  • 1
    Or we can use `ascii_letters` :) Thank you! I have two more inteteresting questions: How do you think - which method your or https://stackoverflow.com/a/47976402/4960953 would be faster? And second one: what do you think about this solution: https://stackoverflow.com/a/196391/4960953 – Mikhail_Sam Dec 26 '17 at 09:32
  • 2
    @Mikhail_Sam Algorithmically, my solution should be better because [Daniel Sanchez'](https://stackoverflow.com/users/1695172/daniel-sanchez) set conversion of the string will always iterate the entire string while mine will break on the first non-ascii char. Whether this truly matters or the C-optimization of the set operations prevails depends a lot on your data, I guess. – user2390182 Dec 26 '17 at 09:35
  • @schwobaseggl Why not get the best of both? `ascii = set(string.ascii_letters); if ascii.issuperset(my_string):` – Artyer Dec 26 '17 at 13:22
  • @Artyer Very good point and rather consequent given the discussion :) I added that. – user2390182 Dec 26 '17 at 13:29
2

Another way just as @schwobaseggl suggest but using full set methods:

import string
ascii = string.ascii_uppercase + string.ascii_lowercase
if set(my_string).issubset(ascii):
    #myString is ascii
Netwave
  • 40,134
  • 6
  • 50
  • 93
  • 1
    Should be faster: `set(my_string).issubset(string.ascii_uppercase + string.ascii_lowercase)` – cs95 Dec 26 '17 at 09:31
1

There's no way to avoid iterating. However, you can certainly make it more efficient by doing not 65 <= ord(s) <= 91 rather than comparing against a range.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • 1
    OP has not tagged Python3, but it should probably be mentioned that checking memership for a `range` should not be noticably slower in Python3 than comparing against ints. – timgeb Dec 26 '17 at 09:16
1

re appears to be quite fast:

import re

# to check whether any outside ranges (->MatchObject) / all in ranges (->None)
nonletter = re.compile('[^a-zA-Z]').search

# to check whether any in ranges (->MatchObject) / all outside ranges (->None)
letter = re.compile('[a-zA-Z]').search

bool(nonletter(myString1))
# True

bool(nonletter(myString2))
# True

bool(nonletter(myString2[:-1]))
# False

Benchmarks for OP's two examples and a positive one (set is @schwobaseggl setset is @DanielSanchez):

Австрия
re               0.48832818 ± 0.09022105 µs
set              0.58745548 ± 0.01759877 µs
setset           0.81759223 ± 0.03595184 µs
AustriЯ
re               0.51960442 ± 0.01881561 µs
set              1.03043942 ± 0.02453405 µs
setset           0.54060076 ± 0.01505265 µs
tralala
re               0.27832978 ± 0.01462306 µs
set              0.88285526 ± 0.03792728 µs
setset           0.43238688 ± 0.01847240 µs

Benchmark code:

import types
from timeit import timeit
import re
import string
import numpy as np

def mnsd(trials):
    return '{:1.8f} \u00b1 {:10.8f} \u00b5s'.format(np.mean(trials), np.std(trials))

nonletter = re.compile('[^a-zA-Z]').search
letterset = set(string.ascii_letters)

def f_re(stri):
    return not nonletter(stri)

def f_set(stri):
    return all(x in letterset for x in stri)

def f_setset(stri):
    return set(stri).issubset(letterset)

for stri in ('Австрия', 'AustriЯ', 'tralala'):
    ref = f_re(stri)
    print(stri)
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert ref == func(stri)
            print("{:16s}".format(name[2:]), mnsd([timeit(
                'f(stri)', globals={'f':func, 'stri':stri}, number=1000) * 1000 for i in range(1000)]))

        except:
            print("{:16s} apparently failed".format(name[2:]))
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Thank you for interesting solution! Thats not obvious that re is faster then another methods! – Mikhail_Sam Dec 26 '17 at 11:51
  • 1
    @Mikhail_Sam It does similar things like the loop / any / all solutions but my guess is that the looping happens in compiled code, so is faster. From the three examples it seems that constructing the `MatchObject` is responsible for a significant part of the cost, because if there is no match (last example) we are much faster. – Paul Panzer Dec 26 '17 at 12:02
  • I got it! But I thought regular explessions are slow by themselves and always avoided to use them if it is possible. But now I see I was wrong :) – Mikhail_Sam Dec 27 '17 at 07:42
0

Here is a non elegant way to accomplish your task. I'm a beginner, so go ahead and tear this apart. But it works! :)

def english_char(string):
    range_0 = [45, 95]
    range_1 = list(range(65, 90))
    range_2 = list(range(97, 122))
    range_3 = list(range(48,57))
    is_ascii = range_0 + range_1 + range_2 + range_3
    for character in string: 
    if ord(character) not in is_ascii:
            return False
    return True

test_1 = 'Abc123_#'
english_char(test_1)
awault
  • 1
  • 1