1

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error

error

tokenizing lemmatization check
[pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir] False
[baca, buku, itu, asik] [baca, buku, itu, asik] True
from spacy.lang.id import Indonesian
import pandas as pd

nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()

data = [
    'pergi untuk melakukan penanganan banjir',
    'baca buku itu asik'
]

df = pd.DataFrame({'text': data})

#Tokenization
def tokenizer(words):
    return [token for token in nlp(words)]


#Lemmatization
def lemmatizer(token):
    return [lem.lemma_ for lem in token]


df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df

How to compare? result before error df.to_clipboard()

                                      text                                     tokenizing                         lemmatization
0  pergi untuk melakukan penanganan banjir  [pergi, untuk, melakukan, penanganan, banjir]  [pergi, untuk, laku, tangan, banjir]
1                       baca buku itu asik                        [baca, buku, itu, asik]               [baca, buku, itu, asik]

Update

The error is fixed. It is because typo. And after fixed the typo the result is like this result the result is all False. What I want is like the table.

caeruleum
  • 459
  • 1
  • 3
  • 16
  • Can you paste here what the dataframe looks like right before the error is called? just above the error, you can do a `df.to_clipboard()` and that'll copy it to your clipboard and you can paste in here for us. I copied the table you do have here, which I assume is your desired output, and I was able to run your df['check'] = df['tokenizing'].eq(df['lemmatization']) and it worked exactly fine for me. so i imagine you've got some other issue in that the dataframe doesn't look how you want before it gets there. – scotscotmcc Nov 30 '21 at 00:57
  • before error [imgur](https://imgur.com/a/DNabQR9) – caeruleum Nov 30 '21 at 01:34
  • the df being posted as an image is only of limited use to others. better is to post the actual text. you can edit your original post to include it. There is a great post on Stack Overflow on [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – scotscotmcc Nov 30 '21 at 01:38
  • thanks for the advice that's what i needed and have updated the post. – caeruleum Nov 30 '21 at 01:49

1 Answers1

1

Base on your code, you forgot i on df['lemmatizaton'].

So that change

df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)

to

df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

Then it may work.

AfterFray
  • 1,751
  • 3
  • 17
  • 22