Parsing a very large array with list comprehension is slow

Question

I have an ultra large list of numerical values in numpy.float64 format, and I want to convert each value, to 0.0 if there's an inf value, and parse the rest the elements to simple float.

This is my code, which works perfectly:

# Values in numpy.float64 format.
original_values = [np.float64("Inf"), np.float64(0.02345), np.float64(0.2334)]

# Convert them
parsed_values = [0.0 if x == float("inf") else float(x) for x in original_values]

But this is slow. Is there any way to faster this code? Using any magic with map or numpy (I have no experience with these libraries)?

Does this answer your question? [Replace -inf with zero value](https://stackoverflow.com/questions/21049920/replace-inf-with-zero-value) — Michael Szczesny, Jun 10 '22 at 09:30
why are you using a `list` of `numpy.float64` objects??? Is that really what you have? Please provide a [mcve]. What is `type(original_values)`? — juanpa.arrivillaga, Jun 10 '22 at 09:31
@KlausD. That array needs to be sent to a MongoDB object which doesn't accept `numpy.float64` values and that's why I need to convert all them to float. — Avión, Jun 10 '22 at 09:36
@juanpa.arrivillaga That's the input I get from a file. I can't modify it. `original_values` is a list of `numpy.float64` values as you can read on the post. And big more than `60.000`` values. I have to this operation for several arrays and its slow. — Avión, Jun 10 '22 at 09:36
@Avión that doesn't make any sense. What is in this file, exactly? How do you create a list of `numpy.float64` objects from this file? — juanpa.arrivillaga, Jun 10 '22 at 09:37
Almost certainly, you have a `numpy.ndarray` object at some point, and you really should be just using that. If at the end you need a `list` to pass to mongo, just use `my_array.tolist()` — juanpa.arrivillaga, Jun 10 '22 at 09:38
Don't care about the sense. I can't modify it. Its something is given to me, it's not in my side. — Avión, Jun 10 '22 at 09:39
@Avión that doesn't make any sense. You aren't "given" a `list` of `numpy.float64` objects from a file, **you** had to create that list somehow. How did you parse the file? In any case, the fastest way would be to use the arrays, and `~np.isfinite` as the answer below seems to indicate. Maybe just convert your list back to a `numpy.ndarray` although again, almost certainly, you *had an array to begin with* If *speed is a concern*, you shouldn't be creating a list to begin with — juanpa.arrivillaga, Jun 10 '22 at 09:40

Mark · Accepted Answer · 2022-06-10T11:32:00.463

3

Hey~ you probably are asking how could you do it faster with numpy, the quick answer is to turn the list into a numpy array and do it the numpy way:

import numpy as np

original_values = [np.float64("Inf"), ..., np.float64(0.2334)]
arr = np.array(original_values)
arr[arr == np.inf] = 0

where arr == np.inf returns another array that looks like array([ True, ..., False]) and can be used to select indices in arr in the way I showed.

Hope it helps.

I tested a bit, and it should be fast enough:

# Create a huge array
arr = np.random.random(1000000000)
idx = np.random.randint(0, high=1000000000, size=1000000)
arr[idx] = np.inf

# Time the replacement
def replace_inf_with_0(arr=arr):
    arr[arr == np.inf] = 0

timeit.Timer(replace_inf_with_0).timeit(number=1)

The output says it takes 1.5 seconds to turn all 1,000,000 infs into 0s in a 1,000,000,000-element array.

@Avión used arr.tolist() in the end to convert it back to a list for MongoDB, which should be the common way. I tried with the billion-sized array, and the conversion took about 30 seconds, while creating the billion-sized array took less than 10 sec. So, feel free to recommend more efficient methods.

edited Jun 10 '22 at 11:32

answered Jun 10 '22 at 09:51

Mark

336
1
7

Thank you. Works, but this outputs an `ndarray` and I need a normal array, as its going to inserted to MongoDB. I get this error if `ndarray` is used: `bson.errors.InvalidDocument: cannot encode object: array([0.00000000e+00, 1.90734868e-01, 3.81469735e-01, ..., 9.76505301e+03, 9.76524375e+03, 9.76543448e+03]), of type: ` – Avión Jun 10 '22 at 09:53
1

It seems to work if I add `tolist()` to the first line: `arr = np.array(original_values).tolist()` – Avión Jun 10 '22 at 09:58
1

Btw there was a typo: it’s 1.5 second, not 15. – Mark Jun 10 '22 at 10:07
1

@Avión `tolist()` looks good to me as well. I can’t think of another solution at the moment – Mark Jun 10 '22 at 10:11

Parsing a very large array with list comprehension is slow

1 Answers1