Some regex doesn't work in numpy arrays

Question

I'm starting with:

vals 
Out[205]: array([['NA\xa0[1] (16.0\xa0to\xa0N/A)', '12.0\xa0[2]', 'NA\xa0[1]']], dtype=object)

then:

v = vals.astype('str')

v = np.char.replace(v,'\xa0' , ' ')                

v
Out[210]: 
array([['NA [1] (16.0 to N/A)', '12.0 [2]', 'NA [1]']],

and then:

v = np.char.replace(v,'\s*\[[0-9]+\]\s*' , '').tolist()

v
Out[212]: [['NA [1] (16.0 to N/A)', '12.0 [2]', 'NA [1]']]

The issue is this second replacement doesn't work while the first does. Regex seems to be ok - it should remove such [1] chars.

I've found later it's something with regex in Python and square brackets: [2] works [0-9] doesn't. How to deal with it?

Ah, so my approach is wrong? It replaces character only? What should I use in this case? — Peter.k, Nov 20 '17 at 21:53
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.core.defchararray.replace.html#numpy.core.defchararray.replace — Grzegorz Oledzki, Nov 20 '17 at 21:54
If you're using numpy, take this a step forward and use pandas, which extensively supports regex. — cs95, Nov 20 '17 at 21:54
Not a perfect tip, but relevant: https://stackoverflow.com/questions/26541279/numpy-array-regex-sub (closed and something a tiny bit different) — Grzegorz Oledzki, Nov 20 '17 at 21:55
If you are working with variable-length strings, you should reconsider using `numpy` arrays. They aren't really built with that use-case in mind. Either maybe `pandas` or even just plain `list` objects would work better. — juanpa.arrivillaga, Nov 20 '17 at 21:58
The `np.char` functions apply regular `str` methods to elements of an array (usually str dtype). While convenient, they don't offer any real speed advantage over explicit iteration. And as you must have found out, they don't work with object dtype. — hpaulj, Nov 20 '17 at 22:01
Thanks for very good suggestions, especially for go into pandas. I've found my provisional solution for now. — Peter.k, Nov 20 '17 at 22:16

score 0 · Answer 1 · answered Nov 20 '17 at 22:19

0

Found the temporary answer before using pandas. Just need to apply re.sub within a list.

[re.sub('\s*\[[0-9]+]\s*', '', x) for x in v[0]]
Out[240]: ['NA(16.0 to N/A)', '12.0', 'NA']

easy-peasy

answered Nov 20 '17 at 22:19

Peter.k

1 Answers1