0

Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese).

In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists.

The three_split function extracts the feature of each name by splitting them into three-character substrings. For example, "Katy Perry" into "kat", "aty", "ty ", "y p" ... etc.

Then I train with Naive Bayes and finally test the results.

There isn't any errors when running my codes, but when I try to use the non-Chinese names directly from the database and expect the program to return False (not Chinese), it returns True (Chinese) for any name I test. Any idea?

import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv", 
    encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]

# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])

df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])

# Function to split word string into three-character substrings
def three_split(word):
    word = str(word).lower().replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)

# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output
KubiK888
  • 4,377
  • 14
  • 61
  • 115

1 Answers1

0

There could be many problems when it comes why you don't get the desired results, most often it's either:

  • Features are not strong enough
  • Not enough training data
  • Wrong classifier
  • Code bugs in NLTK classifiers

For the first 3 reasons, there's no way to verify/resolve unless you post a link to your dataset and we take a look at how to fix it. As for the last reason, there shouldn't be one for the basic NaiveBayes and PositiveNaiveBayes classifier.

So the question to ask is:

  • How many training data instances (i.e. rows) do you have?
  • Why didn't you normalize your labels (i.e. chinese|Chinese -> chinese) after you've read the dataset before extracting the features?
  • What other features to consider?
  • Have you considered using NaiveBayes instead of PositiveNaiveBayes?
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thanks for the reply. I am very confused of why it is not giving me the "expected" outcome. I am relatively new to python, so I built this in multiple stages. First I tested by simply splitting the entire name into "first name" and "last name", and use the classifier, it gives expected results. Then I move onto the three_split function using only a short list I simply typed eg: chinese_names = ["Chun li", "Jackie Chan", ...]. It also gave expected results distinguishing Chinese names and non-Chinese names. – KubiK888 Apr 04 '15 at 02:27
  • Then I made this final coding by using a small sample of full data. This small sample has >1000 names and corresponding ethnicity. There are about 50 Chinese names and the rest are non-Chinese. So probabilistically, it should lean toward "False". Especially when I use the non-Chinese names, and it still outputs "True". I even print the positive_featuresets and unlabeled_featuresets, and they look like to have successfully created the substring feature sets. – KubiK888 Apr 04 '15 at 02:32
  • To answer your other questions, I didn't normalize the labels because I am doing initial testing, and I think my current way to code it before making the list is fine. I will consider other features but this (3-letter substring feature) should be powerful enough to distinguish the Chinese and non-Chinese names. I will consider NaiveBayes. – KubiK888 Apr 04 '15 at 02:34
  • As an update, I have tried to use the full set of data (million+) and it seems to be more accurate (70-80%?). Though I am still confused as to why when using the smaller sample the results tended to be True for Chinese even though the small sample is 10:1 ratio and the names I tested are non-Chinese. – KubiK888 Apr 04 '15 at 04:45
  • And also, I have tried to find more references for me to learn how to use the PositiveNaiveBayesClassifier. But no much I can find. Any idea how I can learn to use this classifier to splice the training and test sets as 50:50 from the original full data set, and how I can perform accuracy validation (like sensitivity and specificity)? Thanks. – KubiK888 Apr 04 '15 at 04:47
  • Have you gone through this: https://www.coursera.org/course/ml ? This is helpful in understanding how basic machine learning works. It took me months to get through the materials as I repeated some lectures quite a few times but it's all worth it. – alvas Apr 04 '15 at 09:40
  • Thanks I think I have gone to it once (not in depth), but what I find hard is actually executing machine learning using Python, as there are many modules, and ways to tell python the parameters, etc – KubiK888 Apr 04 '15 at 16:20