0

I'm trying to match all numbers in a given body of text using re.findall() and convert them to integers. I know that something like [0-9]+ or [\d]+ should match any numbers in the string, however, my output splits numbers up individually (e.g. '125' becomes '1', '2', '5'.

Here's what I have:

import re

regex_list = []

sample = "Here are a bunch of numbers 7746 and 12 and 1929 and 8827 and 7 and 8837 and 128 now convert them"

for line in sample:
    line = line.strip()
    if re.findall('([0-9]+)', line):
        regex_list.append(int(line))
print(regex_list)

Output:

[7, 7, 4, 6, 1, 2, 1, 9, 2, 9, 8, 8, 2, 7, 7, 8, 8, 3, 7, 1, 2, 8]

Desired Output:

[7746, 12, 1929, 8827, 7, 8837, 128]
David
  • 459
  • 5
  • 13
  • 1
    The problem isn't the regex, the problem is your `for` loop. Take a look at the value of `line`... (That should've been one of the first things to do to debug this problem, by the way.) – Aran-Fey Mar 26 '18 at 18:12
  • okay thanks for clarifying, I was not aware that using for loop would have this effect – David Mar 26 '18 at 20:33

3 Answers3

3

Your issue is that you are currently looping through character by character, when you can really just apply the regex to the entire line.

>>> import re    
>>> s = "Here are a bunch of numbers 7746 and 12 and 1929 and 8827 and 7 and 8837 and 128 now convert them"
>>> [int(j) for j in re.findall(r'[0-9]+', s)]
[7746, 12, 1929, 8827, 7, 8837, 128]
user3483203
  • 50,081
  • 9
  • 65
  • 94
2

Have a look at @chrisz's answer for a better solution.

But, if you want to know what's wrong with yours:

Iterating over a string using a for loop gives you single characters, and not words as you thought. To get the words, you'll have to use split().

regex_list = []

sample = "Here are a bunch of numbers 7746 and 12 and 1929 and 8827 and 7 and 8837 and 128 now convert them"

for line in sample.split():
    line = line.strip()
    if re.findall('([0-9]+)', line):
        regex_list.append(int(line))

print(regex_list)
# [7746, 12, 1929, 8827, 7, 8837, 128]

But, since you are getting the words individually, there' no need to use regex. You can directly us isdigit().

for line in sample.split():
    line = line.strip()
    if line.isdigit():
        regex_list.append(int(line))

Or, simply using a list comprehension:

num_list = [int(word) for word in sample.split() if word.isdigit()]
print(num_list)
# [7746, 12, 1929, 8827, 7, 8837, 128]
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
1

for line in sample stores a single character in line, until your sample is a list of lines

pratik mankar
  • 126
  • 1
  • 10