I have few functions for string manipulations, but they also involve libraries other than python's built-in (example: spacy)
Profiling my code tells me that for loops are consuming the most time and I have seen vectorizing as a recommendation to resolve this.
I am asking this question as a primer to my exploration and hence would refrain from dumping the whole code here - rather I will use a simple example of string concatenation and my question is how to vectorize it.
This post quickly explained me vectorization. I then tried to implement it on strings but saw performance worsening..
li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.char.array(li)
def python_for():
return [num + 'x' for num in li]
def numpy_vec():
return nump_arr + 'x'
print("python_for",min(Timer(python_for).repeat(10, 10)))
print("numpy_vec",min(Timer(numpy_vec).repeat(10, 10)))
Results:
python_for 0.048397099948488176
numpy_vec 0.4274819999700412
Python for loop is 8x faster than Numpy
As can be seen , numpy arrays are significantly slower than python For-loops for strings and vice versa for numbers.
I haven't used a simple numpy.array as it throws an error - "ufunc 'add' did not contain a loop with signature matching types (dtype('<U5'), dtype('<U1')) -> None" (for the below code)
li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.array(li)
nump_arr + 's'
np.char.array was recommended in this post
Question:
- How can I speed up my string manipulations?
- Is numpy array not recommended for string manipulations?
Using numpy(v1.23.1)