0

I have a .csv file with data of which i want to transform some columns to one-hot. The problem occurs in the second last line, where the one-hot index (e.g. 1st feature) gets placed in all rows instead of just the one i am in currently. It seems to be some problem with how i access the 2D list... any suggestions? thank you

def one_hot_encode(data_list, column):
    one_hot_list = [[]]
    different_elements = []

    for row in data_list[1:]:                  # count different elements
        if row[column] not in different_elements:
            different_elements.append(row[column])

    for i in range(len(different_elements)):   # set variable names
        one_hot_list[0].append(different_elements[i])

    vector = []                              # create list shape with zeroes
    for i in range(len(different_elements)):
        vector.append(0)
    for i in range(1460):
        one_hot_list.append(vector)

    ind_row = 1                                # encode 1 for each sample
    for row in data_list[1:]:
        index = different_elements.index(row[column])
        one_hot_list[ind_row][index] = 1     # mistake!! sets all rows to 1
        ind_row += 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • There's still an indent error after the first `if` statement. – strubbly May 11 '17 at 10:55
  • Hi, if any answer below has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark next to it. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Lafexlos May 11 '17 at 11:46
  • I removed the word "(solved)" from the question title. (Most future users who have issues are not going to search for their problem using a term of "solved"). The correct way of saying that your problem is solved is by accepting one of the answers - see the link in Lafexlos' comment. – YowE3K May 11 '17 at 20:06

4 Answers4

0

Your problem stems from the vector object you're creating to do the one-hot encoding; you've created one object, and then built a one_hot_list that contains 1460 references to the same object. When you make a change in one of the rows, it will be reflected in all of the rows.

Quick solution would be to create separate copies of the vector for each row (See How to clone or copy a list?):

one_hot_list.append(vector[:])

Some of the other things you're doing in your function are a bit slow or roundabout. I'd suggest a few changes:

def one_hot_encode(data_list, column):
    one_hot_list = [[]]

    # count different elements
    different_elements = set(row[column] for row in data_list[1:])

    # convert different_elements to a list with a canonical order,
    # store in the first element of one_hot_list
    one_hot_list[0] = sorted(different_elements)

    vector = [0] * len(different_elements)   # create list shape with zeroes
    one_hot_list.extend([vector[:] for _ in range(1460)])

    # build a mapping of different_element values to indices into
    # one_hot_list[0]
    index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
    # encode 1 for each sample
    for rindex, row in enumerate(data_list[1:], 1):
        cindex = index_lookup[row[column]]
        one_hot_list[rindex][cindex] = 1

This builds different_elements in linear time by using the set data type, and uses list comprehensions to produce the values for one_hot_list[0] (the list of element values which are being one-hot encoded), the zero vector, and one_hot_list[1:] (which is the actual one-hot-encoded matrix value). Also, there's a dict called index_lookup that lets you quickly map element values onto their integer index, instead of searching for them over and over again. Finally, your row index into the one_hot_list matrix can be managed for you by enumerate.

Community
  • 1
  • 1
wildwilhelm
  • 4,809
  • 1
  • 19
  • 24
0

I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:

for i in range(1460):
    one_hot_list.append(vector)

These are creating the one_hot_list as 1460 references to the same vector of zeros. Whereas I think you want it to be a new vector each time. A direct fix would just be to copy it each time:

for i in range(1460):
    one_hot_list.append(vector[:])

But a more Pythonic approach would be to create the list with a comprehension. Perhaps something like this:

vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]
strubbly
  • 3,347
  • 3
  • 24
  • 36
0

you can use set() for counting unique items in the list

 different_elements = list(set(data[1:]))
Tobias
  • 541
  • 5
  • 12
  • `different_elements` just keeps track of the values of a specific column, not the whole row. You want `different_elements = list(set(row[column] for row in data[1:]))`. – wildwilhelm May 11 '17 at 11:19
0

I suggest you save yourself from the hassle of re-implementing this in plain Python. You can use use pandas.get_dummies for this:

First some test data (test.csv):

A
Foo
Bar
Baz

Then in Python:

import pandas as pd

df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])

table

You can retrieve the underlying numpy array using:

pd.get_dummies(df['A']).values

Which results in:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=uint8)
Matt
  • 17,290
  • 7
  • 57
  • 71