I'm trying to extract 6-digit numbers embedded within texts. The numbers always start with a zero, are always 6 digits long separated by a period after the 4th digit, like so:
0 0133.02[text] in location [texttext](text) numbers
1 0121.08[text] in location [texttext](text) numbers
...
I run the following:
import re
filtered = re.findall("0\d\d\d[.]\d\d", str(df['col']))
There are 478 rows to be parsed, and each row contains the said number. However, the filtered
result only ever outputs 60, even if I change the regex format. Interestingly, filtered
seems to be comprised mostly of numbers from first and last few rows of the 478 rows, but not from the middle?
EDIT: I extracted the rows that work vs don't work, and found that the ones that DO work are the first & last 30 rows (0-29, 448-477).
Here's a sample of the rows that do not work (446, 447):
446 0005.00 [CT] in Vancouver [CMA] (B.C.) 44160
447 0170.05 [CT] in Vancouver [CMA] (B.C.) 44006
And a sample of the rows that do work (448, 449):
448 0050.04 [CT] in Vancouver [CMA] (B.C.) 43995
449 0067.01 [CT] in Vancouver [CMA] (B.C.) 43989