-3

I have an ultra large list of numerical values in numpy.float64 format, and I want to convert each value, to 0.0 if there's an inf value, and parse the rest the elements to simple float.

This is my code, which works perfectly:

# Values in numpy.float64 format.
original_values = [np.float64("Inf"), np.float64(0.02345), np.float64(0.2334)]

# Convert them
parsed_values = [0.0 if x == float("inf") else float(x) for x in original_values]

But this is slow. Is there any way to faster this code? Using any magic with map or numpy (I have no experience with these libraries)?

Avión
  • 7,963
  • 11
  • 64
  • 105
  • Any reason you are not using a numpy array? – Klaus D. Jun 10 '22 at 09:29
  • Does this answer your question? [Replace -inf with zero value](https://stackoverflow.com/questions/21049920/replace-inf-with-zero-value) – Michael Szczesny Jun 10 '22 at 09:30
  • why are you using a `list` of `numpy.float64` objects??? Is that really what you have? Please provide a [mcve]. What is `type(original_values)`? – juanpa.arrivillaga Jun 10 '22 at 09:31
  • @KlausD. That array needs to be sent to a MongoDB object which doesn't accept `numpy.float64` values and that's why I need to convert all them to float. – Avión Jun 10 '22 at 09:36
  • @juanpa.arrivillaga That's the input I get from a file. I can't modify it. `original_values` is a list of `numpy.float64` values as you can read on the post. And big more than `60.000`` values. I have to this operation for several arrays and its slow. – Avión Jun 10 '22 at 09:36
  • @Avión that doesn't make any sense. What is in this file, exactly? How do you create a list of `numpy.float64` objects from this file? – juanpa.arrivillaga Jun 10 '22 at 09:37
  • 2
    Almost certainly, you have a `numpy.ndarray` object at some point, and you really should be just using that. If at the end you need a `list` to pass to mongo, just use `my_array.tolist()` – juanpa.arrivillaga Jun 10 '22 at 09:38
  • Don't care about the sense. I can't modify it. Its something is given to me, it's not in my side. – Avión Jun 10 '22 at 09:39
  • @Avión that doesn't make any sense. You aren't "given" a `list` of `numpy.float64` objects from a file, **you** had to create that list somehow. How did you parse the file? In any case, the fastest way would be to use the arrays, and `~np.isfinite` as the answer below seems to indicate. Maybe just convert your list back to a `numpy.ndarray` although again, almost certainly, you *had an array to begin with* If *speed is a concern*, you shouldn't be creating a list to begin with – juanpa.arrivillaga Jun 10 '22 at 09:40

1 Answers1

3

Hey~ you probably are asking how could you do it faster with numpy, the quick answer is to turn the list into a numpy array and do it the numpy way:

import numpy as np

original_values = [np.float64("Inf"), ..., np.float64(0.2334)]
arr = np.array(original_values)
arr[arr == np.inf] = 0

where arr == np.inf returns another array that looks like array([ True, ..., False]) and can be used to select indices in arr in the way I showed.

Hope it helps.

I tested a bit, and it should be fast enough:

# Create a huge array
arr = np.random.random(1000000000)
idx = np.random.randint(0, high=1000000000, size=1000000)
arr[idx] = np.inf

# Time the replacement
def replace_inf_with_0(arr=arr):
    arr[arr == np.inf] = 0

timeit.Timer(replace_inf_with_0).timeit(number=1)

The output says it takes 1.5 seconds to turn all 1,000,000 infs into 0s in a 1,000,000,000-element array.


@Avión used arr.tolist() in the end to convert it back to a list for MongoDB, which should be the common way. I tried with the billion-sized array, and the conversion took about 30 seconds, while creating the billion-sized array took less than 10 sec. So, feel free to recommend more efficient methods.

Mark
  • 336
  • 1
  • 7
  • Thank you. Works, but this outputs an `ndarray` and I need a normal array, as its going to inserted to MongoDB. I get this error if `ndarray` is used: `bson.errors.InvalidDocument: cannot encode object: array([0.00000000e+00, 1.90734868e-01, 3.81469735e-01, ..., 9.76505301e+03, 9.76524375e+03, 9.76543448e+03]), of type: ` – Avión Jun 10 '22 at 09:53
  • 1
    It seems to work if I add `tolist()` to the first line: `arr = np.array(original_values).tolist()` – Avión Jun 10 '22 at 09:58
  • 1
    Btw there was a typo: it’s 1.5 second, not 15. – Mark Jun 10 '22 at 10:07
  • 1
    @Avión `tolist()` looks good to me as well. I can’t think of another solution at the moment – Mark Jun 10 '22 at 10:11