1

I have data of the following format:

Col1   Col2       Col3
1,    1424549456, "3 4"
2,    1424549457, "2 3 4 5"

& have successfully read it into pandas.

How can I turn Col3 to a numpy matrix of the following form:

# each value needs to become a 1 in the index of the col
# i.e. in the above example 3 is the 4th value, thus
# it is [0 0 0 1]  [0 indexing is included]
mtx = [0 0 0 1 1 0    # corresponds to first row
       0 0 1 1 1 1];  # corresponds to second row

Thanks for any help you can provide!

bge0
  • 901
  • 2
  • 10
  • 25

2 Answers2

3

Since 0.13.1 there's str.get_dummies:

In [11]: s = pd.Series(["3 4", "2 3 4 5"])

In [12]: s.str.get_dummies(sep=" ")
Out[12]:
   2  3  4  5
0  0  1  1  0
1  1  1  1  1

You have to ensure the columns are integers (rather than strings) and reindex:

In [13]: df = s.str.get_dummies(sep=" ")

In [14]: df.columns = df.columns.map(int)

In [15]: df.reindex(columns=np.arange(6), fill_value=0)
Out[15]:
   0  1  2  3  4  5
0  0  0  0  1  1  0
1  0  0  1  1  1  1

To get the numpy values use .values:

In [16]: df.reindex(columns=np.arange(6), fill_value=0).values
Out[16]:
array([[0, 0, 0, 1, 1, 0],
       [0, 0, 1, 1, 1, 1]])
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
0

if there's not a lot of data you can do something like

res = []
def f(v):
    r = np.zeros(6, np.int)
    r[map(int, v.split())] = 1
    res.append(r)
df.Col3.apply(f)
mat = np.array(res)

# if you really want it to be a matrix, you can do
mat = np.matrix(res)

check out this link for more info

Community
  • 1
  • 1
acushner
  • 9,595
  • 1
  • 34
  • 34