3

I want to keep only numbers from an numpy array of strings, which are not necessarily valid. My code looks looks like the following:

age = train['age'].to_numpy() # 200k values
set(age)
# {'1', '2', '3', '7-11', np.nan...} 

age  = np.array(['1', '2', '3', '7-11', np.nan])

Desired output: np.array([1, 2, 3]). Ideally, '7-11' would be 7, however, that's not simple and is a tolerable loss.

np.isfinite(x) gives "ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"

x = [num for num in age if isinstance(num, (int, float))] returns []

Ali Pardhan
  • 194
  • 1
  • 14
  • This is in essence a follow-up to the following question, https://stackoverflow.com/q/11620914/5728614 – Ali Pardhan Jan 21 '22 at 02:11
  • Please provide the desired output, which I assume is an array with just [1, 2] in it? Should return any integers and floating point numbers, but what about a string that is a number, e.g. '1'. Should that be included as an integer type? A little more context about edge cases, etc would be helpful – frederick-douglas-pearce Jan 21 '22 at 02:26
  • Maybe this [SO answer](https://stackoverflow.com/questions/1277914/is-there-a-way-to-output-the-numbers-only-from-a-python-list) is helpful? If you are going to have to evaluate `x` like shown, with mixed types including strings, then the output `x` array will all be strings, and you can use `isnumeric` to filter the array for numbers, [see docs](https://numpy.org/doc/stable/reference/generated/numpy.char.isnumeric.html) – frederick-douglas-pearce Jan 21 '22 at 02:29
  • 1
    @frederick-douglas-pearce I have made a large edit and corrected the data types. I attempted one of the solutions to the answer you linked. I will give `isnumeric` a try – Ali Pardhan Jan 21 '22 at 02:58

2 Answers2

2

Here's an option that will split strings on '-' first, and only take the first value, so '7-11' is converted to 7:

age = np.array(['1', '2', '3', '7-11', np.nan])
age_int = np.array([int(x[0]) for x in np.char.split(age, sep='-') if x[0].isdecimal()])

Output: array([1, 2, 3, 7])

There is a more efficient way to do this if you don't care about cases like '7-11':

age_int2 = age[np.char.isdecimal(age)].astype(int)

Output2: array([1, 2, 3])

1

You could do something like the following

for pos, val in enumerate(age):
    try:
        new_val = int(val)
    except:
        new_val = np.nan
    age[pos] = new_val

age = age[age!="nan"].astype(int)

print(age)
> array([1, 2, 3])
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
  • 1
    After a small change, which I edited in, this code solves my problem. Thank you! I am going to wait a few days before accepting this, in case there is a nice one-liner or 'better' solution. – Ali Pardhan Jan 21 '22 at 03:45
  • 1
    @frederick-douglas-pearce has a great one-liner! – BoomBoxBoy Jan 21 '22 at 15:46