-1

I have a list with a lot of words, so I don't want to write a nested loop, 'cause it will take a lot of time for the program to run. So maybe there is a way to check whether the word consists of punctuation, something like function any(map(str.isdigit, s1)) isdigits when we have to check numbers?

jamesss
  • 87
  • 1
  • 6

2 Answers2

1

Unless the list is very large, or your CPU is low-performance, it is not going to take much time to process a list of words. Consider the example below, which has 1 million 20-character strings.

import random
import string

In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]

In [17]: %%timeit -n 3 -r 3
    ...: [any(map(str.isdigit, s1)) for s1 in s]
    ...: 
    ...: 
1.23 s ± 2.53 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

In [18]: %%timeit -n 3 -r 3
    ...: [any([s2 in string.punctuation for s2 in s1]) for s1 in s]
    ...: 
    ...: 
1.72 s ± 18.1 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

You could speed it up with a regular expression

import re
import string

In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]

In [17]: patt = re.compile('[%s]' % re.escape(string.punctuation))

In [18]: %%timeit -n 3 -r 3
[bool(re.match(patt, s1)) for s1 in s]

1.03 s ± 3.23 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Eric Truett
  • 2,970
  • 1
  • 16
  • 21
0

It may depend on what you define as "punctuation". The module string defines string.punctuation as '!"#$%&\'()*+,-./:;<=>?@[\\]^_``{|}~'. You may also define it as "what isn't alphanumeric" (a-zA-Z0-9), or "what isn't alpha" (a-zA-Z).

Here I define a very long string of alphanumeric characters, and the same with an added dot ., shuffled.

import numpy as np
import string

mystr_no_punct = np.random.choice(list(string.ascii_letters) + 
                                  list(string.digits), 1e8)
mystr_withpunct = np.append(mystr_no_punct, '.')
np.random.shuffle(mystr_no_punct)
mystr_withpunct = "".join(mystr_withpunct)
mystr_no_punct = "".join(mystr_no_punct)

Below is an implementation of the naive iteration with a for loop, and some possible answers, according to what you look for, with time comparisons

def naive(mystr):
    for x in mystr_no_punct:
        if x in string.punctuation:
            return False
    return True

# naive solution
%timeit naive(mystr_withpunct)
%timeit naive(mystr_no_punct)

# check if string is only alnum
%timeit str.isalnum(mystr_withpunct) 
%timeit str.isalnum(mystr_no_punct)

# reduce to a set of the present characters, compare with the set of punctuation characters
%timeit len(set(mystr_withpunct).intersection(set(string.punctuation))) > 0
%timeit len(set(mystr_no_punct).intersection(set(string.punctuation))) > 0

# use regex
import re

%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_withpunct)) > 0
%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_no_punct)) > 0

With the following results

# naive
53.9 ms ± 928 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.1 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# str.isalnum
4.17 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.47 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# sets intersection
8.26 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.2 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# regex
8.43 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.51 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So using the built-in isalnum is clearly the fastest. But if you have specific needs, regex or sets intersection seem a good fit.

Battleman
  • 392
  • 2
  • 12