Check data column is same or not with Pandas

Question

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error

tokenizing	lemmatization	check
[pergi, untuk, melakukan, penanganan, banjir]	[pergi, untuk, laku, tangan, banjir]	False
[baca, buku, itu, asik]	[baca, buku, itu, asik]	True

from spacy.lang.id import Indonesian
import pandas as pd

nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()

data = [
    'pergi untuk melakukan penanganan banjir',
    'baca buku itu asik'
]

df = pd.DataFrame({'text': data})

#Tokenization
def tokenizer(words):
    return [token for token in nlp(words)]


#Lemmatization
def lemmatizer(token):
    return [lem.lemma_ for lem in token]


df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df

How to compare? result before error df.to_clipboard()

                                      text                                     tokenizing                         lemmatization
0  pergi untuk melakukan penanganan banjir  [pergi, untuk, melakukan, penanganan, banjir]  [pergi, untuk, laku, tangan, banjir]
1                       baca buku itu asik                        [baca, buku, itu, asik]               [baca, buku, itu, asik]

Update

The error is fixed. It is because typo. And after fixed the typo the result is like this the result is all False. What I want is like the table.

Can you paste here what the dataframe looks like right before the error is called? just above the error, you can do a `df.to_clipboard()` and that'll copy it to your clipboard and you can paste in here for us. I copied the table you do have here, which I assume is your desired output, and I was able to run your df['check'] = df['tokenizing'].eq(df['lemmatization']) and it worked exactly fine for me. so i imagine you've got some other issue in that the dataframe doesn't look how you want before it gets there. — scotscotmcc, Nov 30 '21 at 00:57
the df being posted as an image is only of limited use to others. better is to post the actual text. you can edit your original post to include it. There is a great post on Stack Overflow on [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — scotscotmcc, Nov 30 '21 at 01:38
thanks for the advice that's what i needed and have updated the post. — caeruleum, Nov 30 '21 at 01:49

score 1 · Answer 1 · answered Nov 30 '21 at 01:29

1

Base on your code, you forgot i on df['lemmatizaton'].

So that change

df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)

to

df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

Then it may work.

answered Nov 30 '21 at 01:29

AfterFray

1,751
3
17
22

Ahh typo, my bad. But, the result is all False and actually not same – caeruleum Nov 30 '21 at 01:37
@teapartyyyy with your given example for `df['tokenizing']` and df[`lemmatization`], it returns index 0 as False, and index 1 as True. What is expected result? Do you want to compare similarity? – AfterFray Nov 30 '21 at 01:41
Yes, I want to compare similarity. But, after fixed my typo the result is all False [result](https://imgur.com/a/733UW4w) what I want is like my table on my post. – caeruleum Nov 30 '21 at 01:54
@teapartyyyy what is `type(df['lemmatization'][1])` and `type(df['tokenizing'][1])`? – AfterFray Nov 30 '21 at 02:26
both is `` – caeruleum Nov 30 '21 at 06:15

Check data column is same or not with Pandas

Update

1 Answers1