How to express [a-zA-Z0-9] for one char?

Question

I need to convert this c++ function into python:

bool validText(const char *text)
{
    bool result = false;
    while (*text)
    {
        if ((*text) & 0x80)
        {
            result = true;
            text++; 
        }
        else if ((*text >= 'a' && *text <= 'z' || *text >= 'A' && *text <= 'Z') ||
                 ((*text) >= '0' && (*text) <= '9'))
        {
            result = true;
        }
        text++;
    }
    return result;
}

I understand that it is checking whether the string consists of chars with its int value is >= 128 or within the range of [a-zA-Z0-9].

My python version looks like:

    def validText(text):

    valid = False
    for s in text:
        c = ord(s)
        if c >= 128:
            valid = True
            break
        elif( (c>='a' and c<='z') or (c>='A' and c<='Z') (c>='0' and c<='9') ):
            valid = True
            break
    
    return valid

I have two questions for this:

Is my understanding of the c++ right?
Is the python version right?

I can't find a proper string to test this, so not sure whether I am doing the right thing.

cannot you just call `text.isalnum()` ? it cheks if given string consist only alphanumeric characters — Take_Care_, Feb 03 '22 at 19:12
@Take_Care_, No. You have to also check all other kinds of characters with 0x80 and above. — marlon, Feb 03 '22 at 19:13
Looks like it does what you say... I don't know C++ so can't comment on your understanding of it. — Eli Harold, Feb 03 '22 at 19:15
@marlon why would `.isalnum()` not work? you just want to know if any single char is alnum right? so why not `for s in text:` `if s.isalnum(): return True` — Eli Harold, Feb 03 '22 at 19:17
It's actually looking to see if *any* character in the string is an alphanumeric ASCII character or has its high bit set. `result` is initialized to false and irrevocably changed to true the first time you see such a character. As such, it's not clear why it doesn't simply return true immediately, rather than continuing through the string, if it can never return false again. — chepner, Feb 03 '22 at 19:17
(There's also the weird--yet irrelevant--assumption that any character following 0x80 should be ignored.) — chepner, Feb 03 '22 at 19:18
`any('a' <= x.lower() <= 'z' or ord(x) >= 128 or '0' <= x <= '9' for x in text)` would be accurate, but it's not clear that the C++ code is doing what is intended in the first place. — chepner, Feb 03 '22 at 19:19
@chepner what do you mean by "continuing through the string?" doesn't it break the loop immediately and return `True`? or are you talking about the C++ code? — Eli Harold, Feb 03 '22 at 19:21
@chepner "any character following 0x80" means "(*text) & 0x80". I am not familiar with c++ either. — marlon, Feb 03 '22 at 19:22
@EliHarold Your Python breaks; the C does not. There's an extra `text++` when the character is greater than 127. — chepner, Feb 03 '22 at 19:39
@chepner it's not my code... but thanks for the clarification. — Eli Harold, Feb 03 '22 at 20:44

Bemis · Answer 1 · 2022-02-03T20:24:55.853

The first check (*text) & 0x80 is simply if the char is non ascii (https://www.asciitable.com/), so greater-than-or-equal to 128 would work.

The program itself appears to be checking if there exists any single non-ascii or alpha numeric chars in the character set (isalnum will not work!). The C++ program, however does not short circuit the loop (it will always iterate through the entire null terminated string input, even when it has found a character that meets the condition). It is odd that the program skips characters if the first criteria (non ascii) is met (extra text++). If you wanted a matching python port, you could modify how you iterate:

def valid_text(text):
    valid = False
    i = 0
    while i < len(text):
        s = text[i]
        c = ord(s)
        if c >= 128:
            valid = True
            i += 1
        elif ord('z') <= c >= ord('a') or ord('Z') <= c >= ord('A') or ord('0') >= c <= ord('9'):
            valid = True
        i += 1
    return valid

Logically your approach should work, but you need to be sure to do ord-to-ord comparison!

Here are some run examples using python 3.6:

>>> valid_text('////')
False
>>> valid_text('////a')
True
>>> valid_text('/a///')
True
>>> valid_text('/Ω///')
True

Short circuit example:

    def valid_text(text):
        i = 0
        while i < len(text):
            s = text[i]
            c = ord(s)
            if c >= 128:
                return True
            elif ord('z') <= c >= ord('a') or ord('Z') <= c >= ord('A') or ord('0') >= c <= ord('9'):
                return True
            i += 1
        return False

Example without Ord:

    def valid_text(text):
        i = 0
        while i < len(text):
            c = text[i]
            if c >= '\x80':
                return True
            elif 'a' <= c <= 'z' or 'A' <= c <= 'Z' or '0' <= c <= '9':
                return True
            i += 1
        return False

Why don't you return True immediately in Python code? It seems meaningfulless. — marlon, Feb 03 '22 at 19:58
this was to show that a direct port can be done. If you want a logical port, you can "short circuit" (return right away on the first True) — Bemis, Feb 03 '22 at 20:11
"/" is ascii! However, it does not meet the alpha-numeric criteria of the "else-if" — Bemis, Feb 03 '22 at 20:32

score 0 · Answer 2 · answered Feb 03 '22 at 19:14

0

You can use text.isalnum()

'asadslk adçs'.isalnum()
False

'azAZ90'.isalnum()
True

answered Feb 03 '22 at 19:14

Daniel Lopes

814
5
20

This will ignore the first part in c++: "(*text) & 0x80" – marlon Feb 03 '22 at 19:16
but I guess You know what this `(*text) & 0x80` in the C++ does, do you ? It is check for the UTF follow byte skip https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/ And because in Python you already have the data encoded, you dont need this check I guess – Take_Care_ Feb 03 '22 at 19:38
@Take_Care_ If it's intended to skip an entire UTF-8 sequence, it's doing it poorly. If it's a 3-character sequence, it's going to skip 4 bytes total. – chepner Feb 03 '22 at 19:41
@chepner I am not an author of this c++ code, I am just trying to guess what intention of this 0x80 check was. – Take_Care_ Feb 03 '22 at 19:46
I guess my take is that you really don't want to port this function as written, because it's almost certainly incorrect. – chepner Feb 03 '22 at 19:48

score 0 · Accepted Answer · answered Feb 03 '22 at 19:46

First, you should realize that any character that has its highest bit set (i.e c & 0x80 == 0x80) is nonascii.

To answer your questions:

Your understanding seems correct; to be precise, the C++ code checks if the string contains any character is alphanumeric or non-ascii. However, the C++ code probably contains a bug: for any non-ascii character seen, it advances the pointer once, and at the end advances the pointer again. This can cause the pointer to be advanced too many times, should the last character be nonascii.
The Python code looks correct based on your understanding, however a much better solution would just be to replace the code with (assuming Python >= 3.7):

def validText(text):
    return bool(text and (not text.isascii() or any(t.isalnum() for t in text)))

This does what the appears to be the intent of the C++ code:

for ex in [
    "",
    "\n",
    "\na",
    "a\n",
    "1\n",
    "\n1",
    "\n漢",
    "漢\n"
]:
    print(f"'{ex}': {validText(ex)}")
    print("----")

Output

'': False
----
'
': False
----
'
a': True
----
'a
': True
----
'1
': True
----
'
1': True
----
'
漢': True
----
'漢
': True
----

The issue is that the function should return `false` if `text` is empty. However, `text and (bool)`, when `text` is empty, returns `''`, which probably isn't desired. — wLui155, Feb 03 '22 at 20:20
'' and True is '', not False. Why doesn't a logical operator produce a logic result? — marlon, Feb 03 '22 at 20:25
That is correct, the behavior has to do with how operators deal with truthy values, but the answer [here](https://stackoverflow.com/questions/47007680/how-do-and-and-or-act-with-non-boolean-values#:~:text=The%20only%20operator%20returning%20a%20boolean%20value%20regardless,%28args%29%20-%20min%20%28args%29%20%22%2C%20otherwise%20return%200.) explains this behavior. — wLui155, Feb 03 '22 at 20:59

How to express [a-zA-Z0-9] for one char?

3 Answers3