Using pandas drop duplicates but doesn't correctly drop the duplicates

Question

First of all, I'm not sure whether it is drop_duplicates() fault or not.

What I want to do:
Import file from csv, do a re.search on every row, if match, keep the row inside a dictionary, if doesn't match, keep the row inside another dictionary. Make a graph out of the length of the dictionary value.

The problem
I have 1000 rows inside csv, but the result returns 1200.

My code

import pandas as pd
import re

# import data
filename = 'sample.csv'

# save data as data
data = pd.read_csv(filename, encoding='utf-8')

# create new dictionary for word that is true and false 
# but doesn't have the keyword in items
wordNT = {}
wordNF = {}
kaiT = {}
kaiF = {}

# if text is True
def word_in_text(word,text,label):
    match = re.search(word,text)

    if match and label == True:
        kaiT.setdefault('text', []).append(text)
    elif match and label == False:
        kaiF.setdefault('text', []).append(text)
    elif label == True and not match:
        wordNT.setdefault('text', []).append(text)
    elif label == False and not match:
        wordNF.setdefault('text', []).append(text)

# iterate every text in data
for index, row in data.iterrows():
    word_in_text('foo', row['text'], row['label'])
    word_in_text('bar', row['text'], row['label'])

# make pandas data frame out of dict
wordTDf = pd.DataFrame.from_dict(wordNT)
wordFDf = pd.DataFrame.from_dict(wordNF)
kaiTDf = pd.DataFrame.from_dict(kaiT)
kaiFDf = pd.DataFrame.from_dict(kaiF)

# drop duplicates
wordTDf = wordTDf.drop_duplicates()
wordFDf = wordFDf.drop_duplicates()
kaiTDf = kaiTDf.drop_duplicates()
kaiFDf = kaiFDf.drop_duplicates()

# count how many 
wordTrueCount = len(wordTDf.index)
wordFalseCount = len(wordFDf.index)
kaiTrueCount = len(kaiTDf.index)
kaiFalseCount = len(kaiFDf.index)

print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)

When I removed the line

word_in_text('bar', row['text'], row['label'])

and only keep

word_in_text('foo', row['text'], row['label'])

print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount) returns 1000 correctly, and vice versa. But when I don't, it returns 1200 when it should only be 1000?

CSV INPUT sample
text,label
"hey", TRUE
"halo", FALSE
"How are you?", TRUE

EXPECTED OUTPUT
1000

OUTPUT
1200

Adjust your expected output to match the example in input please. — user3471881, Jan 13 '19 at 14:21
Look here on how to write a *good* pandas question: https://stackoverflow.com/a/20159305/3471881 — user3471881, Jan 13 '19 at 14:23

score 0 · Accepted Answer · answered Jan 14 '19 at 02:51

In the function word_in_text, you update the four dict: wordNT, wordNF, kaiT and kaiF.

And you call word_in_text twice while iterating the dataframe:

# iterate every text in data
for index, row in data.iterrows():
    word_in_text('foo', row['text'], row['label'])
    word_in_text('bar', row['text'], row['label'])

So the searching result is the mix of the result from 'foo' and result from 'bar'.

Instead, you should clean up the four dict before starting a new search:

def search(text):
    wordNT = {}
    wordNF = {}
    kaiT = {}
    kaiF = {}

    # iterate every text in data
    for index, row in data.iterrows():
        word_in_text(text, row['text'], row['label'])

    # make pandas data frame out of dict
    wordTDf = pd.DataFrame.from_dict(wordNT)
    wordFDf = pd.DataFrame.from_dict(wordNF)
    kaiTDf = pd.DataFrame.from_dict(kaiT)
    kaiFDf = pd.DataFrame.from_dict(kaiF)

    # drop duplicates
    wordTDf = wordTDf.drop_duplicates()
    wordFDf = wordFDf.drop_duplicates()
    kaiTDf = kaiTDf.drop_duplicates()
    kaiFDf = kaiFDf.drop_duplicates()

    # count how many 
    wordTrueCount = len(wordTDf.index)
    wordFalseCount = len(wordFDf.index)
    kaiTrueCount = len(kaiTDf.index)
    kaiFalseCount = len(kaiFDf.index)

    print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)

search('foo')
search('bar')

Yeah I actually realized that when I tried to answer @user3471881 comment. I need the count later o make a graph so instead what I did was adding the `def word_in_text(word1, word2, text, label):` like this — Sae, Jan 14 '19 at 04:44
Hi @Sae if this or any answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. — keineahnung2345, Jan 15 '19 at 09:16

Using pandas drop duplicates but doesn't correctly drop the duplicates

1 Answers1