-1

I'm looking for a way to clean the following data:

enter image description here

I would like to output something like this:

enter image description here

with the tokenized words in the first column and their associated labels on the other.

Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?

Thank you in advance for your help or advice

Lter
  • 43
  • 11
  • Use [this](https://stackoverflow.com/a/57122617/2901002) solution, not accepted answer below or in dupe. – jezrael Dec 07 '20 at 12:45

1 Answers1

0

Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:

import pandas as pd

data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
df1 = pd.DataFrame(data, columns=['col1', 'col2'])

print(df1)

df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
                 for _, row in df1.iterrows()]).reset_index()
df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
print(df2)

The output:

        col1    col2
0    foo bar       O
1   George B  PERSON
2  President   TITLE

        col1    col2
0        foo       O
1        bar       O
2     George  PERSON
3          B  PERSON
4  President   TITLE

As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.

Max
  • 12,794
  • 30
  • 90
  • 142
  • Ya, unfortunately accepted answer not always means the best answer like this solution copied from dupe :( – jezrael Dec 07 '20 at 12:46