1

First of all, I'm not sure whether it is drop_duplicates() fault or not.


What I want to do:
Import file from csv, do a re.search on every row, if match, keep the row inside a dictionary, if doesn't match, keep the row inside another dictionary. Make a graph out of the length of the dictionary value.


The problem
I have 1000 rows inside csv, but the result returns 1200.


My code

import pandas as pd
import re

# import data
filename = 'sample.csv'

# save data as data
data = pd.read_csv(filename, encoding='utf-8')

# create new dictionary for word that is true and false 
# but doesn't have the keyword in items
wordNT = {}
wordNF = {}
kaiT = {}
kaiF = {}

# if text is True
def word_in_text(word,text,label):
    match = re.search(word,text)

    if match and label == True:
        kaiT.setdefault('text', []).append(text)
    elif match and label == False:
        kaiF.setdefault('text', []).append(text)
    elif label == True and not match:
        wordNT.setdefault('text', []).append(text)
    elif label == False and not match:
        wordNF.setdefault('text', []).append(text)

# iterate every text in data
for index, row in data.iterrows():
    word_in_text('foo', row['text'], row['label'])
    word_in_text('bar', row['text'], row['label'])

# make pandas data frame out of dict
wordTDf = pd.DataFrame.from_dict(wordNT)
wordFDf = pd.DataFrame.from_dict(wordNF)
kaiTDf = pd.DataFrame.from_dict(kaiT)
kaiFDf = pd.DataFrame.from_dict(kaiF)

# drop duplicates
wordTDf = wordTDf.drop_duplicates()
wordFDf = wordFDf.drop_duplicates()
kaiTDf = kaiTDf.drop_duplicates()
kaiFDf = kaiFDf.drop_duplicates()

# count how many 
wordTrueCount = len(wordTDf.index)
wordFalseCount = len(wordFDf.index)
kaiTrueCount = len(kaiTDf.index)
kaiFalseCount = len(kaiFDf.index)

print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)


When I removed the line

word_in_text('bar', row['text'], row['label'])

and only keep

word_in_text('foo', row['text'], row['label'])


print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount) returns 1000 correctly, and vice versa. But when I don't, it returns 1200 when it should only be 1000?


CSV INPUT sample
text,label
"hey", TRUE
"halo", FALSE
"How are you?", TRUE


EXPECTED OUTPUT
1000


OUTPUT
1200

Sae
  • 79
  • 7

1 Answers1

0

In the function word_in_text, you update the four dict: wordNT, wordNF, kaiT and kaiF.

And you call word_in_text twice while iterating the dataframe:

# iterate every text in data
for index, row in data.iterrows():
    word_in_text('foo', row['text'], row['label'])
    word_in_text('bar', row['text'], row['label'])

So the searching result is the mix of the result from 'foo' and result from 'bar'.

Instead, you should clean up the four dict before starting a new search:

def search(text):
    wordNT = {}
    wordNF = {}
    kaiT = {}
    kaiF = {}

    # iterate every text in data
    for index, row in data.iterrows():
        word_in_text(text, row['text'], row['label'])

    # make pandas data frame out of dict
    wordTDf = pd.DataFrame.from_dict(wordNT)
    wordFDf = pd.DataFrame.from_dict(wordNF)
    kaiTDf = pd.DataFrame.from_dict(kaiT)
    kaiFDf = pd.DataFrame.from_dict(kaiF)

    # drop duplicates
    wordTDf = wordTDf.drop_duplicates()
    wordFDf = wordFDf.drop_duplicates()
    kaiTDf = kaiTDf.drop_duplicates()
    kaiFDf = kaiFDf.drop_duplicates()

    # count how many 
    wordTrueCount = len(wordTDf.index)
    wordFalseCount = len(wordFDf.index)
    kaiTrueCount = len(kaiTDf.index)
    kaiFalseCount = len(kaiFDf.index)

    print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)

search('foo')
search('bar')
keineahnung2345
  • 2,635
  • 4
  • 13
  • 28
  • Yeah I actually realized that when I tried to answer @user3471881 comment. I need the count later o make a graph so instead what I did was adding the `def word_in_text(word1, word2, text, label):` like this – Sae Jan 14 '19 at 04:44
  • Hi @Sae if this or any answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. – keineahnung2345 Jan 15 '19 at 09:16