1

I would like to merge two columns into one as a list of words/tokens. Currently my dataset looks like:

A_Col   B_Col             C_Col
home    my house          I have a new house
paper   research paper    my mobile phone is broken
NaN     NaN               zoe zaczek who
NaN     NaN               two per cent

NaN is a value for empty field.

What I would like to do is the following: keeping column A_Col but merge B_Col and C_Col in order to have something like this:

A_Col   BC_Col            
home    ['my', 'house','I', 'have', 'a', 'new', 'house']
paper   ['research', 'paper', 'my', 'mobile', 'phone', 'is,','broken']
NaN     ['zoe', 'zaczek', 'who']
NaN     ['two', 'per', 'cent']

Looking at the problem, the steps required should be:

  • tokenize B_Col;
  • tokenize C_Col;
  • merge the results;
  • remove NaN values, whether they are.

For the first two points I am using the following:

df['B_Col'] = df.apply(lambda row: nltk.word_tokenize(row['B_Col']))
df['C_Col'] = df.apply(lambda row: nltk.word_tokenize(row['C_Col']))

For merging the results:

df['BC_Col'] = df['B_Col'] + df['C_Col']

Then I should remove NaN values.

However something does not work in my code as I am not getting the tokenisation for B_Col and C_Col. I hope you can help me to understand my error. Thanks.

still_learning
  • 776
  • 9
  • 32
  • Does this answer your question? [How to apply NLTK word\_tokenize library on a Pandas dataframe for Twitter data?](https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data) – Trenton McKinney Aug 15 '20 at 20:41
  • 1
    [nltk how to tokenize pandas columns](https://www.google.com/search?q=nltk+how+to+tokenize+pandas+columns+site:stackoverflow.com&sxsrf=ALeKk01kN4Px92rwcYq6RU306Nr-JUrZAg:1597524036958&sa=X&ved=2ahUKEwjSxvjliJ7rAhVcIjQIHboNDbwQrQIoBHoECAcQBQ&biw=1920&bih=977) – Trenton McKinney Aug 15 '20 at 20:42

1 Answers1

1

you could do:

df['BC_Col'] = df['B_Col'].fillna('').str.split() + df['C_Col'].fillna('').str.split()
df
    A_Col   B_Col   C_Col   BC_Col
0   home    my house    I have a new house  [my, house, I, have, a, new, house]
1   paper   research paper  my mobile phone is broken   [research, paper, my, mobile, phone, is, broken]
2   NaN NaN zoe zaczek who  [zoe, zaczek, who]
3   NaN NaN two per cent    [two, per, cent]
Ayoub ZAROU
  • 2,387
  • 6
  • 20