Count number of words per row

Question

I'm trying to create a new column in a DataFrame that contains the word count for the respective row. I'm looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I'm stuck. I've tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.

words = df['col'].split()
df['totalwords'] = len(words)

results in

AttributeError: 'Series' object has no attribute 'split'

and

f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)

results in

AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')

score 62 · Accepted Answer · edited Jun 20 '20 at 09:12

62

`str.split` + `str.len`

str.len works nicely for any non-numeric column.

df['totalwords'] = df['col'].str.split().str.len()

`str.count`

If your words are single-space separated, you may simply count the spaces plus 1.

df['totalwords'] = df['col'].str.count(' ') + 1

List Comprehension

This is faster than you think!

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 23 '18 at 15:43

cs95

379,657
97
704
746

@lucid_dreamer it splits each string on whitespace into a list of words, then returns the length of each list. – cs95 Jun 16 '19 at 20:30
but why not just `df['totalwords'] = df['col'].str.split().len()`? – lucid_dreamer Jun 16 '19 at 21:42
@lucid_dreamer because that's not correct? There's no len() function defined on Series. – cs95 Jun 16 '19 at 22:55
I'm confused. First str.split() -> returns a series where each element is a lists right? After that I'm lost. – lucid_dreamer Jun 16 '19 at 23:45
oh! I get it now: str.len can operate on arbitrary lists, it's not a string-specific function. Confusing. – lucid_dreamer Jun 16 '19 at 23:48
2

@lucid_dreamer Take a look at [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), pandas has a suite of str methods defined on object dtype columns, some of them (such as `str.len()` and `str.count()`) work for arbitrary containers. – cs95 Jun 17 '19 at 00:04
If you have things like commas, this will count them as words. See [this answer](https://stackoverflow.com/questions/50444346/fast-punctuation-removal-with-pandas) for how to remove the punctuation before getting the word count. – user2739472 Oct 25 '20 at 06:10
1

List comprehension is much faster. It took me 1 min, vs 3 min of the split+len method – Ferro Nov 18 '21 at 14:22

sacuL · Answer 2 · 2018-04-23T15:53:39.500

17

Here is a way using .apply():

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

example

Given this df:

>>> df
                    col
0  This is one sentence
1           and another

After applying the .apply()

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

>>> df
                    col  number_of_words
0  This is one sentence                4
1           and another                2

Note: As pointed out by in comments, and in this answer, .apply is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ's methods.

edited Apr 23 '18 at 15:53

answered Apr 23 '18 at 15:43

sacuL

49,704
8
81
106

2

apply = slower version of a loop. This is a good idea, but something like `[len(x.split()) for x in df['col']]` would be prime. It's in my answer but feel free to add it to yours as well. – cs95 Apr 23 '18 at 15:47

score 8 · Answer 3 · answered Apr 23 '18 at 15:40

This is one way using pd.Series.str.split and pd.Series.map:

df['word_count'] = df['col'].str.split().map(len)

The above assumes that df['col'] is a series of strings.

Example:

df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})

df['word_count'] = df['col'].str.split().map(len)

print(df)

#                   col  word_count
# 0  This is an example           4
# 1     This is another           3
# 2             A third           2

score 4 · Answer 4 · answered Apr 23 '18 at 16:00

4

With list and map data from cold

list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]

answered Apr 23 '18 at 16:00

BENY

317,841
20
164
234

score 0 · Answer 5 · answered Mar 04 '22 at 04:47

You could also map split and len methods to the strings in the DataFrame column:

df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]

Here's some preliminary benchmark of the answers given here. map seems to do well on very large Series:

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 
                   'one banana', 'fruits']*100000, 
                  columns=['col'])
>>> df.shape
(600000, 1)

>>> %timeit df['word_count'] = df['col'].str.split().str.len()
761 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].str.count(' ').add(1)
691 ms ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = [len(x.split()) for x in df['col'].tolist()]
405 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].apply(lambda x: len(x.split()))
450 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].str.split().map(len)
657 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = list(map(lambda x : len(x.split()), df['col'].tolist()))
435 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
329 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

George Shimanovsky · Answer 6 · 2023-04-01T10:23:08.097

You may use a simple regex expression within Pandas' built-in str.count() method:

df['total_words'] = df['col'].str.count('\w+')

\w character class matches any word character, which includes any letter, digit, or underscore. It is equivalent to the character range [A-Za-z0-9_].
+ sign for 1 or unlimited repeat times.

Or use the following regex if you would like words consisting of alphabetic symbols only:

  df['total_words'] = df['col'].str.count('[A-Za-z]+')

Count number of words per row

6 Answers6

`str.split` + `str.len`

`str.count`

List Comprehension

Linked

Related

Count number of words per row

6 Answers6

str.split + str.len

str.count

List Comprehension

Linked

Related

`str.split` + `str.len`

`str.count`