0

I have following dataframe in Pandas

    publish_date    headline_text
    20030219        aba decides against community broadcasting 
    20030219        act fire witnesses must be aware of defamation
    20030219        a g calls for infrastructure protection summit
    20030219        air nz staff in aust strike for pay rise
    20030219        air nz strike to affect australian travellers

I want to add one more column where Count of Tokens should get displayed without white spaces.

I am doing following, but it gives me count of tokens with white space.

nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len()
Neil
  • 7,937
  • 22
  • 87
  • 145

2 Answers2

2

You could always remove the whitespace before taking the length:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.replace('\s', '', regex=True).str.len()
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39

Or remove the number of whitespace from the total length:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len() - nlp_df['headline_text'].str.count('\s')
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39

\s is the regex class to match any whitespace character. See doc:

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

Of course you can also use the non-whitespace class \S as suggested by @mozway, which is even simpler:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.count('\S')
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39
Cimbali
  • 11,012
  • 1
  • 39
  • 68
2

IIUC, you want to count the words? or the non-space characters?

counting the words:

You can count the non-spaced patterns:

df['words'] = df['headline_text'].str.count('\S+')

or, split the string and get the list length:

df['words'] = df['headline_text'].str.split('\s+').apply(len)

or, count the separators and add 1:

df['words'] = df['headline_text'].str.count('\s+').add(1)

counting the letters:

df['letters'] = df['headline_text'].str.count('\S')

output:

   publish_date                                   headline_text  words  letters
0      20030219      aba decides against community broadcasting      5       38
1      20030219  act fire witnesses must be aware of defamation      8       39
2      20030219  a g calls for infrastructure protection summit      7       40
3      20030219        air nz staff in aust strike for pay rise      9       32
4      20030219   air nz strike to affect australian travellers      7       39

mozway
  • 194,879
  • 13
  • 39
  • 75