0

Let's say I have an array

arr = np.array([['a ', 'b ', 'c'], ['d ', 'e ', 'f']])

and I want to turn it into

[['a b c'], ['d e f']]

using only vectorized operations. What's the most efficient way of doing it?

Kliker
  • 65
  • 4
  • You can try a combination of `for loops` and `join()`. – Marcelo Paco Apr 14 '23 at 16:53
  • 1
    If mentioning the most efficient, what are the ways you tried? – CristiFati Apr 14 '23 at 17:02
  • Also *NumPy* is fast for numbers, I don't know about strings, it might not be suitable here. Anyway, check https://stackoverflow.com/q/49788027/4788546. – CristiFati Apr 14 '23 at 17:09
  • `numpy` doesn't add anything to a pure list and string operation: `[''.join(i) for i in arr.tolist()]`. Feel free to do your own timings on realist sized arrays. `np.char` has some functions that apply string methods to elements of arrays, but I haven't anything that works fpr this. Plus those functions don't offer any speed benefits. – hpaulj Apr 14 '23 at 18:15
  • I had done a simple for loop: ``` for triple in str_triples: triple = ' '.join(triple) pre_processed.append(triple) ``` I only tried it in a small amount of data (about 30k rows), but I need it to work in potentially millions of rows, so I was a bit wary of just using a for loop for this use case. – Kliker Apr 14 '23 at 18:28

1 Answers1

1

Cast your array values to python's object type to infer further string addition with np.sum operation:

arr = np.sum(arr.astype(object), axis=1, keepdims=True)

[['a b c']
 ['d e f']]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • What a great solution, thank you! Could something like this be used here? arr_triples = train.triples column1 = np.char.add(arr_triples[:,:2], ' ') column2 = arr_triples[:, 2:] str_triples = np.concatenate((column1, column2), axis=1).astype('str') del arr_triples del column1 del column2 Where arr.triples is of shape (30000, 3), but the string in each of the first two columns does not have a space (therefore why I use char.add()). Ideally I would like not to have to create a new variable for the first two columns, one for the third one and concatenate them together. – Kliker Apr 14 '23 at 18:36
  • @Kliker, welcome. It's not clear what are the actual values of ` arr_triples` and the final expected result from the comment you left; - that could be moved to a separate question – RomanPerekhrest Apr 14 '23 at 19:02
  • If the data is actually in a `pandas` database, look at `pandas` own string handlers. For a start, it uses python strings, not numpy ones. I don't know if its faster, but numpy's handling of string is not fast. – hpaulj Apr 14 '23 at 20:03