3

How should I extract numbers only from

a = ['1 2 3', '4 5 6', 'invalid']

I have tried:

mynewlist = [s for s in a if s.isdigit()]
print mynewlist

and

for strn in a:
    values = map(float, strn.split())
print values

Both failed because there is a space between the numbers.

Note: I am trying to achieve output as:

[1, 2, 3, 4, 5, 6]
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
labmat
  • 193
  • 1
  • 1
  • 10
  • Are numbers separated by spaces only or do you expect there might be something else, like a comma or something? – Dunno Oct 28 '16 at 15:02
  • what are u getting using `values = map(float, strn.split())` ? – levi Oct 28 '16 at 15:02
  • @Dunno: yes they are separated by space instead of a comma – labmat Oct 28 '16 at 15:05
  • @levi I get ValueError: could not convert string to float: invalid – labmat Oct 28 '16 at 15:06
  • 1
    It doesn't matter much in this case because why your code didn't work is obvious, but you should get in the habit of including full stack traces and detailed descriptions of how your code is failing in your SO questions. When you ask SO questions about more complex problems in the future, saying your code "failed" is probably going to result in a closed question because its not nearly detailed enough. – skrrgwasme Oct 28 '16 at 15:11

7 Answers7

9

I think you need to process each item in the list as a split string on whitespace.

a = ['1 2 3', '4 5 6', 'invalid']
numbers = []
for item in a:
    for subitem in item.split():
        if(subitem.isdigit()):
            numbers.append(subitem)
print(numbers)

['1', '2', '3', '4', '5', '6']

Or in a neat and tidy comprehension:

[item for subitem in a for item in subitem.split() if item.isdigit()]
paleolimbot
  • 406
  • 2
  • 3
3

That should do for your particular case since you include a string within list. Therefore you need to flatten it:

new_list = [int(item) for sublist in a for item in sublist if item.isdigit()]
SuperRafek
  • 31
  • 3
3

Assuming the list is just strings:

[int(word) for sublist in map(str.split, a) for word in sublist if word.isdigit()]
Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
2

With the help of sets you can do:

>>> a = ['1 2 3', '4 5 6', 'invalid']
>>> valid = set(" 0123456789")
>>> [int(y) for x in a if set(x) <= valid for y in x.split()]
[1, 2, 3, 4, 5, 6]

This will include the numbers from a string only if the string consists of characters from the valid set.

Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
0
mynewlist = [s for s in a if s.isdigit()]
print mynewlist

doesnt work because you are iterating on the content of the array, which is made of three string:

  1. '1 2 3'
  2. '4 5 6'
  3. 'invalid'

that means that you have to iterate again on each of those strings.

you can try something like

mynewlist = []
for s in a:
    mynewlist += [digit for digit in s if digit.isdigit()] 
a.costa
  • 1,029
  • 1
  • 9
  • 19
  • You should avoid using the word "array" in variable names that have lists. Arrays and lists are different data structures with different attributes and capabilities. – skrrgwasme Oct 28 '16 at 15:14
  • Was there an early revision that used it? Because it's not there now or in the question revision history. – skrrgwasme Oct 28 '16 at 15:19
0

One liner solution:

new_list = [int(m) for n in a for m in n if m in '0123456789']
jtitusj
  • 3,046
  • 3
  • 24
  • 40
0

There are lots of option to extract numbers from a list of strings.

A general list of strings is assumed as follows:

input_list = ['abc.123def45, ghi67 890 12, jk345', '123, 456 78, 90', 'abc def, ghi'] * 10000

If the conversion into an integer is not considered,

def test_as_str(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return output_list

%timeit -n 10 -r 7 test_as_str(input_list)
> 37.6 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    output_list = []
    
    [output_list.extend(re.findall(r'\d+', string)) for string in input_list]
    
    return output_list

%timeit -n 10 -r 7 test_as_str(input_list)
> 39.5 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    return list(itertools.chain(*[re.findall(r'\d+', string) for string in input_list]))

%timeit -n 10 -r 7 test_as_str(input_list)
> 40.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    return list(filter(None, [item for string in input_list for item in re.split('[^\d]+' , string)]))

%timeit -n 10 -r 7 test_as_str(input_list)
> 42.8 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The conversion into an integer can be also considered.

def test_as_int(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return list(map(int, output_list))

%timeit -n 10 -r 7 test_as_int(input_list)
> 44.7 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return [int(item) for item in output_list]

%timeit -n 10 -r 7 test_as_int(input_list)
> 47.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.findall(r'\d+', string)]

%timeit -n 10 -r 7 test_as_int(input_list)
> 48.3 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if item]

%timeit -n 10 -r 7 test_as_int(input_list)
> 51.4 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if item.isdigit()]

%timeit -n 10 -r 7 test_as_int(input_list)
> 54.9 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if len(item)]

%timeit -n 10 -r 7 test_as_int(input_list)
> 55.5 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The performance test, which does not show much difference, is done on Windows OS, Python 3.8.8 virtual environment.

J. Choi
  • 1,616
  • 12
  • 23