You could always remove the whitespace before taking the length:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.replace('\s', '', regex=True).str.len()
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39
Or remove the number of whitespace from the total length:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len() - nlp_df['headline_text'].str.count('\s')
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39
\s
is the regex class to match any whitespace character. See doc:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v]
is matched.
Of course you can also use the non-whitespace class \S
as suggested by @mozway, which is even simpler:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.count('\S')
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39