-1

I'm trying to extract 6-digit numbers embedded within texts. The numbers always start with a zero, are always 6 digits long separated by a period after the 4th digit, like so:

0 0133.02[text] in location [texttext](text) numbers 1 0121.08[text] in location [texttext](text) numbers ...

I run the following:

import re filtered = re.findall("0\d\d\d[.]\d\d", str(df['col']))

There are 478 rows to be parsed, and each row contains the said number. However, the filtered result only ever outputs 60, even if I change the regex format. Interestingly, filtered seems to be comprised mostly of numbers from first and last few rows of the 478 rows, but not from the middle?

EDIT: I extracted the rows that work vs don't work, and found that the ones that DO work are the first & last 30 rows (0-29, 448-477).

Here's a sample of the rows that do not work (446, 447): 446 0005.00 [CT] in Vancouver [CMA] (B.C.) 44160 447 0170.05 [CT] in Vancouver [CMA] (B.C.) 44006

And a sample of the rows that do work (448, 449): 448 0050.04 [CT] in Vancouver [CMA] (B.C.) 43995 449 0067.01 [CT] in Vancouver [CMA] (B.C.) 43989

eh2699
  • 127
  • 1
  • 4
  • 11
  • 1
    Hard to see what's wrong without seeing the data. Your regex should be working. – L3viathan Dec 30 '17 at 23:46
  • Is your question related to this one? https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex – Mehdi Dec 30 '17 at 23:49
  • try `'\d+\.?\d*'` and see the count of results. Possibly some of your elements are not matching your expected format – Moinuddin Quadri Dec 30 '17 at 23:49
  • 1
    You need to share more data for this to work. We can't guess what is wrong. How about extracting the rows that doesn't work? – Anton vBR Dec 30 '17 at 23:55

1 Answers1

1

These are some things to help you solve this. I will remove this as it is not an answer.

import re
import pandas as pd

data = dict(col=['texttexttext 0036.01 texttext','texttexttext 0006.21 texttext'])
df = pd.DataFrame(data)

re.findall("0\d{3}\.\d{2}", str(df['col'])) #Simplified your regex

Effectively creates:

['0036.01', '0006.21']

How about trying this:

re.findall("0\d{3}\.\d{2}",' '.join(df['col'].tolist()))

And if middle rows are not working, extract a sample, e.g., and share that with us:

print('\n'.join(df['col'][200:220].tolist()))
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • thanks for the suggestion, i edited the post to include sample of the rows that don't work – eh2699 Dec 31 '17 at 00:42
  • @eh2699 strange.. what if you do the ' '.join() function instead? – Anton vBR Dec 31 '17 at 00:43
  • that worked!! I'm sorry but I'm new to python, can you take a moment? - what does join() do to make it work that not using that didn't? – eh2699 Dec 31 '17 at 00:56
  • 1
    @eh2699 tbh it was more of a guess. `str.join()` glues a list of strings together to one long string. My guess is that str(df['col']) didn't return all values. – Anton vBR Dec 31 '17 at 00:58