I am working through a homework question that I am having considerable trouble with. Conceptually I want to take a text file of ten categories across 5000 records and binarize all unique features in that dataset (comes out to 232). I then want to create a numpy array from this that is a 5000 x 232 matrix with only 0s and 1s. I then want to apply this to a test dataset, with 1000 records. However, this test dataset has a few features (not columns) that were not in the original. I am unable to create the new binary matrix of 1000 x 232 because I cannot come up with a way to ignore the new data.
The dataset looks like this:
45, Federal-gov, Bachelors, Married-civ-spouse, Adm-clerical, White, Male, 45, United-States, <=50K
33, Private, 5th-6th, Married-spouse-absent, Transport-moving, Other, Male, 20, El-Salvador, <=50K
19, Private, Some-college, Never-married, Transport-moving, White, Male, 40, United-States, <=50K
So far what I have done is read both files in and convert them to map objects, then create a dictionary of (key, value) pairs that are column:value equivalents. This was then applied to the dataset to create an all numeric dataset where each of the records had a list of ten numbers associated with it. However, trying to do the same thing with the second dataset did not work.
This is the code I have used so far:
import numpy as np
train = map(lambda s: s.strip().split(", "), open('income.train.txt').readlines())
dev = map(lambda s: s.strip().split(", "), open('income.dev.txt').readlines())
mapping = {}
new_train = []
for row in train:
new_row = []
for j, x in enumerate(row):
feature = (j, x)
if feature not in mapping:
mapping[feature] = len(mapping)
new_row.append(mapping[feature])
new_train.append(new_row)
Everything up to this point works, the next bit is what throws an error.
new_dev = []
for row in dev:
new_row = []
for j, x in enumerate(row):
feature = (j,x)
if feature not in mapping:
feature = (None, None)
new_row.append(mapping[feature])
new_dev.append(new_row)
This works, until it hits a value that was not in the mapping dictionary (I know this, because there is a partial new_dev dataset that has the first 125 records, so the 126th is the first one with a new value). The error I get is always associated with the feature = (None, None) line (KeyError: (None, None)). I have tried a few other variations (just one None, (0,0), etc.) all to no avail.
This has been very difficult for me to grasp, and I am sure I am missing something simple, any help is very appreciated.