one-hot encoding, access list elements

Question

I have a .csv file with data of which i want to transform some columns to one-hot. The problem occurs in the second last line, where the one-hot index (e.g. 1st feature) gets placed in all rows instead of just the one i am in currently. It seems to be some problem with how i access the 2D list... any suggestions? thank you

def one_hot_encode(data_list, column):
    one_hot_list = [[]]
    different_elements = []

    for row in data_list[1:]:                  # count different elements
        if row[column] not in different_elements:
            different_elements.append(row[column])

    for i in range(len(different_elements)):   # set variable names
        one_hot_list[0].append(different_elements[i])

    vector = []                              # create list shape with zeroes
    for i in range(len(different_elements)):
        vector.append(0)
    for i in range(1460):
        one_hot_list.append(vector)

    ind_row = 1                                # encode 1 for each sample
    for row in data_list[1:]:
        index = different_elements.index(row[column])
        one_hot_list[ind_row][index] = 1     # mistake!! sets all rows to 1
        ind_row += 1

There's still an indent error after the first `if` statement. — strubbly, May 11 '17 at 10:55
Hi, if any answer below has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark next to it. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. — Lafexlos, May 11 '17 at 11:46
I removed the word "(solved)" from the question title. (Most future users who have issues are not going to search for their problem using a term of "solved"). The correct way of saying that your problem is solved is by accepting one of the answers - see the link in Lafexlos' comment. — YowE3K, May 11 '17 at 20:06

score 0 · Accepted Answer · edited May 23 '17 at 12:17

Your problem stems from the vector object you're creating to do the one-hot encoding; you've created one object, and then built a one_hot_list that contains 1460 references to the same object. When you make a change in one of the rows, it will be reflected in all of the rows.

Quick solution would be to create separate copies of the vector for each row (See How to clone or copy a list?):

one_hot_list.append(vector[:])

Some of the other things you're doing in your function are a bit slow or roundabout. I'd suggest a few changes:

def one_hot_encode(data_list, column):
    one_hot_list = [[]]

    # count different elements
    different_elements = set(row[column] for row in data_list[1:])

    # convert different_elements to a list with a canonical order,
    # store in the first element of one_hot_list
    one_hot_list[0] = sorted(different_elements)

    vector = [0] * len(different_elements)   # create list shape with zeroes
    one_hot_list.extend([vector[:] for _ in range(1460)])

    # build a mapping of different_element values to indices into
    # one_hot_list[0]
    index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
    # encode 1 for each sample
    for rindex, row in enumerate(data_list[1:], 1):
        cindex = index_lookup[row[column]]
        one_hot_list[rindex][cindex] = 1

This builds different_elements in linear time by using the set data type, and uses list comprehensions to produce the values for one_hot_list[0] (the list of element values which are being one-hot encoded), the zero vector, and one_hot_list[1:] (which is the actual one-hot-encoded matrix value). Also, there's a dict called index_lookup that lets you quickly map element values onto their integer index, instead of searching for them over and over again. Finally, your row index into the one_hot_list matrix can be managed for you by enumerate.

score 0 · Answer 2 · answered May 11 '17 at 11:00

I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:

for i in range(1460):
    one_hot_list.append(vector)

These are creating the one_hot_list as 1460 references to the same vector of zeros. Whereas I think you want it to be a new vector each time. A direct fix would just be to copy it each time:

for i in range(1460):
    one_hot_list.append(vector[:])

But a more Pythonic approach would be to create the list with a comprehension. Perhaps something like this:

vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]

score 0 · Answer 3 · answered May 11 '17 at 11:00

0

you can use set() for counting unique items in the list

 different_elements = list(set(data[1:]))

answered May 11 '17 at 11:00

Tobias

541
5
12

`different_elements` just keeps track of the values of a specific column, not the whole row. You want `different_elements = list(set(row[column] for row in data[1:]))`. – wildwilhelm May 11 '17 at 11:19

score 0 · Answer 4 · answered May 11 '17 at 11:49

I suggest you save yourself from the hassle of re-implementing this in plain Python. You can use use pandas.get_dummies for this:

First some test data (test.csv):

A
Foo
Bar
Baz

Then in Python:

import pandas as pd

df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])

You can retrieve the underlying numpy array using:

pd.get_dummies(df['A']).values

Which results in:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=uint8)

one-hot encoding, access list elements

4 Answers4