1

I have a 2d numpy array called my_data. Each row represents information about one data point and each column represents different attributes of that data point.

I have a function called processRow. It takes in a row, and does some processing on the info and returns the modified row. The length of the row returned by the function is longer than the row taken in by the function (the function basically expands some categorical data into one-hot vectors)

How can I have a numpy array where every row has been processed by this function?

I tried

answer = np.array([])
for row in my_data:
    answer = np.append(answer,processRow(row))

but at the end, the answer is just a single really long row rather than a 2d grid

quantumbutterfly
  • 1,815
  • 4
  • 23
  • 38

2 Answers2

2

You can use vstack rather since row has a different shape to answer. You also need to be explicit with the shape of answer:

In [11]: my_data = np.array([[1, 2], [3, 4]])
    ...: process_row = lambda x: x  # do nothing

In [12]: answer = np.empty((0, 2), dtype='int64')
    ...: for row in my_data:
    ...:     answer = np.vstack([answer, process_row(row)])
    ...:

In [13]: answer
Out[13]:
array([[ 1,  2],
       [ 3,  4]])

However, you're probably better off doing a list comprehension, and then passing it to numpy after:

In [21]: np.array([process_row(row) for row in my_data])
Out[21]:
array([[1, 2],
       [3, 4]])
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • You may want to do this in cython, or pandas, to be more performant, but it's unclear what the best strategy is without more information on process_row. – Andy Hayden May 06 '18 at 21:52
  • I tried this but it was way too slow. It ran for 10 minutes before I quit it. I think `answer = np.vstack([answer, process_row(row)])` takes linear time because it copies `answer` each time, making it take O(n^2) time overall. @Aklys 's solution runs much faster so I'm accepting his solution – quantumbutterfly May 06 '18 at 22:32
  • @quantumbutterfly yes, his answer is the same as the one at the end of mine "you're better of a list comprehension" ... only more verbose. – Andy Hayden May 06 '18 at 22:35
  • Oh whoops, overlooked that part. Sorry about that – quantumbutterfly May 06 '18 at 22:38
1

I'm not sure if I entirely got what you were after without seeing a sample of the data. But hopefully this helps you get to the result you want. I simplified the concept and just added one to each value in the row passed to the function and added the results together for a total (just to expand the size of the returned array). Of course you could adjust the processing to whatever you wanted.

def funky(x):
    temp = []
    for value in x:
        value += 1
        temp.append(value)
    temp.append(temp[0] + temp[1])
    return np.array(temp)

my_data = np.array([[1,1], [2,2]]) 

answer = np.apply_along_axis(funky, 1, my_data)
print("This is the original data:\n{}".format(my_data))
print("This is the adjusted data:\n{}".format(answer))

Below is the before and after of the array modification:

This is the original data:
[[1 1]
 [2 2]]
This is the adjusted data:
[[2 2 4]
 [3 3 6]]
Aklys
  • 461
  • 1
  • 4
  • 15