Binarization of one dataset from key value pairs of another dataset

Question

I am working through a homework question that I am having considerable trouble with. Conceptually I want to take a text file of ten categories across 5000 records and binarize all unique features in that dataset (comes out to 232). I then want to create a numpy array from this that is a 5000 x 232 matrix with only 0s and 1s. I then want to apply this to a test dataset, with 1000 records. However, this test dataset has a few features (not columns) that were not in the original. I am unable to create the new binary matrix of 1000 x 232 because I cannot come up with a way to ignore the new data.

The dataset looks like this:

45, Federal-gov, Bachelors, Married-civ-spouse, Adm-clerical, White, Male, 45, United-States, <=50K
33, Private, 5th-6th, Married-spouse-absent, Transport-moving, Other, Male, 20, El-Salvador, <=50K
19, Private, Some-college, Never-married, Transport-moving, White, Male, 40, United-States, <=50K

So far what I have done is read both files in and convert them to map objects, then create a dictionary of (key, value) pairs that are column:value equivalents. This was then applied to the dataset to create an all numeric dataset where each of the records had a list of ten numbers associated with it. However, trying to do the same thing with the second dataset did not work.

This is the code I have used so far:

import numpy as np

train = map(lambda s: s.strip().split(", "), open('income.train.txt').readlines())
dev = map(lambda s: s.strip().split(", "), open('income.dev.txt').readlines())

mapping = {}
new_train = []

for row in train:
     new_row = []
     for j, x in enumerate(row):
         feature = (j, x)
         if feature not in mapping:
             mapping[feature] = len(mapping)
         new_row.append(mapping[feature])
     new_train.append(new_row)

Everything up to this point works, the next bit is what throws an error.

new_dev = []
for row in dev:
    new_row = []
    for j, x in enumerate(row):
        feature = (j,x)
        if feature not in mapping:
            feature = (None, None)
        new_row.append(mapping[feature])
    new_dev.append(new_row)

This works, until it hits a value that was not in the mapping dictionary (I know this, because there is a partial new_dev dataset that has the first 125 records, so the 126th is the first one with a new value). The error I get is always associated with the feature = (None, None) line (KeyError: (None, None)). I have tried a few other variations (just one None, (0,0), etc.) all to no avail.

This has been very difficult for me to grasp, and I am sure I am missing something simple, any help is very appreciated.

If `feature` is `not in mapping` what would you like to append to `new_row`.? — wwii, Apr 19 '20 at 02:29
Does this answer your question? [Return None if Dictionary key is not available](https://stackoverflow.com/questions/6130768/return-none-if-dictionary-key-is-not-available) — wwii, Apr 19 '20 at 02:31
When posting a question about code that produces an Exception, always include the complete Traceback - copy and paste it then format it as code (select it and type `ctrl-k`) — wwii, Apr 19 '20 at 02:32

Binarization of one dataset from key value pairs of another dataset

0 Answers0