Regex inside findall vs regex inside count

Question

This is a follow up question to How to count characters in a string? and to Find out how many times a regex matches in a string in Python

I want to count all alphabet characters in the string:

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

The str.count() method allows for counting a specific letter. How would one do that for counting any letter in the entire alphabet in a string, using the count method?

I am trying to use a regex inside the count method, but it returns 0 instead of 83. The code I am using is:

import re

spam_data['text'][0].count((r'[a-zA-Z]'))

When I use:

len(re.findall((r'[a-zA-Z]'), spam_data['text'][0])) it returns a length of 83.

Why does count return a 0 here?

the `count()` doesn't accept `regex`, it accepts it as string. — BladeMight, Oct 18 '18 at 21:52

Abhi · Accepted Answer · 2018-10-18T22:05:26.590

2

You should use str.count instead of count.

spam_data['text'].str.count('\w')

0    83
Name: text, dtype: int64

To access the first value use:

spam_data['text'].str.count('\w')[0]
83

edited Oct 18 '18 at 22:05

answered Oct 18 '18 at 21:58

Abhi

4,068
1
16
29

Do you know why .`str.count('\w')` works for spam_data['text'].str.count('\w) (i.e.) a dataframe column, but not for an indexed Series created from spam_data['text']? – ZakS Oct 27 '18 at 18:28
It's not clear what you meant here. Maybe an example code to state the issue? – Abhi Oct 28 '18 at 06:02
Hi @Abhi, if it's possible to look here I'd be grateful! https://stackoverflow.com/questions/53026049/when-does-str-count-w-work-and-when-doesnt-it?noredirect=1#comment92953894_53026049 – ZakS Oct 28 '18 at 10:33

deadvoid · Answer 2 · 2018-10-18T23:05:10.427

How would one do that for counting any letter in the entire alphabet in a string, using the count method?

wrd = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
>>>> count = sum([''.join({_ for _ in wrd if _.isalpha()}).count(w) for w in wrd])
>>>> count
83

explanation: get the sum of unique letters count (inside a set) in the wrd using list comprehension.
similar to:

count = []
set_w = set()
for w in wrd:
    if w.isalpha():
        set_w.add(w)

for w in set_w:
    count.append(wrd.count(w))

print(sum(count))

Willem Van Onsem · Answer 3 · 2018-10-19T12:11:54.877

Short answer: you did not use a regex, but a raw string literal, and thus count occurrences of the string '[a-zA-Z].

Because a string of the format r'..' is not a regex, it is a raw string literal. If you write r'\n', you write a string with two characters: a backslash and an n. not a new line. Raw strings are useful in the context of regexes, because regexes use a lot of escaping as well.

For example:

>>> r'\n'
'\\n'
>>> type(r'\n')
<class 'str'>

But here you thus count the number of times the string '[a-zA-Z]' occurs, and unless your spam_data['text'][0] literally contains a square bracket [ followed by a, etc., the count will be zero. Or as specified in the documentation of str.count [Python-doc]:

string.count(s, sub[, start[, end]])

Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.)

In case the string is rather large, and you do not want to construct a list of matches, you can count the number of elements with:

sum(1 for _ in re.finditer('[a-zA-Z]', 'mystring'))

It is however typically faster to simply use re.findall(..) and then calculate the number of elements.

BladeMight · Answer 4 · 2018-10-18T21:47:48.147

1

In this one:

spam_data['text'][0].count((r'[a-zA-Z]'))

the count accepts parameter by string, not regex, that is why it returns 0.

Use your second example.

edited Oct 18 '18 at 21:47

answered Oct 18 '18 at 21:33

BladeMight

2,670
2
21
35

1

Then why does it return `1` for `'[a]'.count(r'[a]')`? – Willem Van Onsem Oct 18 '18 at 21:42
but now your answer seems to suggest that because the item to count is not a regex, it will always return `0`. – Willem Van Onsem Oct 18 '18 at 21:47
1

Or if it contains raw string like the *regex* in count firts parameter, e.g. `'a in [a]'.count(r'[a]')` => `1` – BladeMight Oct 18 '18 at 21:51

Regex inside findall vs regex inside count

4 Answers4

Linked