regex findall for numbers

Question

I'm trying to extract 6-digit numbers embedded within texts. The numbers always start with a zero, are always 6 digits long separated by a period after the 4th digit, like so:

0 0133.02[text] in location [texttext](text) numbers 1 0121.08[text] in location [texttext](text) numbers ...

I run the following:

import re filtered = re.findall("0\d\d\d[.]\d\d", str(df['col']))

There are 478 rows to be parsed, and each row contains the said number. However, the filtered result only ever outputs 60, even if I change the regex format. Interestingly, filtered seems to be comprised mostly of numbers from first and last few rows of the 478 rows, but not from the middle?

EDIT: I extracted the rows that work vs don't work, and found that the ones that DO work are the first & last 30 rows (0-29, 448-477).

Here's a sample of the rows that do not work (446, 447): 446 0005.00 [CT] in Vancouver [CMA] (B.C.) 44160 447 0170.05 [CT] in Vancouver [CMA] (B.C.) 44006

And a sample of the rows that do work (448, 449): 448 0050.04 [CT] in Vancouver [CMA] (B.C.) 43995 449 0067.01 [CT] in Vancouver [CMA] (B.C.) 43989

Hard to see what's wrong without seeing the data. Your regex should be working. — L3viathan, Dec 30 '17 at 23:46
Is your question related to this one? https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex — Mehdi, Dec 30 '17 at 23:49
try `'\d+\.?\d*'` and see the count of results. Possibly some of your elements are not matching your expected format — Moinuddin Quadri, Dec 30 '17 at 23:49
You need to share more data for this to work. We can't guess what is wrong. How about extracting the rows that doesn't work? — Anton vBR, Dec 30 '17 at 23:55

Anton vBR · Accepted Answer · 2017-12-31T00:07:54.937

1

These are some things to help you solve this. I will remove this as it is not an answer.

import re
import pandas as pd

data = dict(col=['texttexttext 0036.01 texttext','texttexttext 0006.21 texttext'])
df = pd.DataFrame(data)

re.findall("0\d{3}\.\d{2}", str(df['col'])) #Simplified your regex

Effectively creates:

['0036.01', '0006.21']

How about trying this:

re.findall("0\d{3}\.\d{2}",' '.join(df['col'].tolist()))

And if middle rows are not working, extract a sample, e.g., and share that with us:

print('\n'.join(df['col'][200:220].tolist()))

edited Dec 31 '17 at 00:07

answered Dec 30 '17 at 23:59

Anton vBR

18,287
5
40
46

thanks for the suggestion, i edited the post to include sample of the rows that don't work – eh2699 Dec 31 '17 at 00:42
@eh2699 strange.. what if you do the ' '.join() function instead? – Anton vBR Dec 31 '17 at 00:43
that worked!! I'm sorry but I'm new to python, can you take a moment? - what does join() do to make it work that not using that didn't? – eh2699 Dec 31 '17 at 00:56
1

@eh2699 tbh it was more of a guess. `str.join()` glues a list of strings together to one long string. My guess is that str(df['col']) didn't return all values. – Anton vBR Dec 31 '17 at 00:58

regex findall for numbers

1 Answers1