0

I have a pandas dataframe of descriptions like this:

df['description']

22CI003294 PARCEL 32 
22CI400040 NORFOLK ESTATES 
12CI400952 & 13CI403261
22CI400628 GARDEN ACRES
9CI00208 FERNHAVEN SEC
22CI400675 CECIL AVE SUB
22CI400721 124.69' SS
BOLLING AVE SS

I want to extract the first alphanumeric characters that are at least 6 characters in length. They have to start with a digit and then can repeat any amount of digit or letters. So, expected results from above:

22CI003294
22CI400040
12CI400952
22CI400628
9CI00208
22CI400675
22CI400721
None

What I have tried:

df['results'] = df['description'].str.extract(r'(\d*\w+\d+\w*){6,}')

When I added in {6,} at the end I suddenly get no matches. Please advise.

amnesic
  • 259
  • 1
  • 7
  • Maybe you need `^([^\W_]{6,})`? Or, `^(?=[^\W\d_]*\d)[^\W_]{6,}`? Or even `^(?=\d*[^\W\d_])(?=[^\W\d_]*\d)[^\W_]{6,}`? – Wiktor Stribiżew Jul 18 '22 at 11:13
  • I just need to filter out by character numbers once it's matched. Why doesn't my regex work after I added in `{6,}`? – amnesic Jul 18 '22 at 11:17
  • It makes no sense here, you quantified the whole sequence and using it with `str.extract`, you just would have extracted the last capture. – Wiktor Stribiżew Jul 18 '22 at 11:18
  • So, what is that you are after? Extract the alphanumeric string with at least one digit and at least one letter at the start of string if there are 6 or more chars? – Wiktor Stribiżew Jul 18 '22 at 11:19
  • Basically, `^([^\W_]{6,})` as you suggested will work. I want to extract any alphanumeric characters that are greater than 6 characters. My regex of `(\d*\w+\d+\w*)` matches the numbers just fine. But I have no idea why I couldn't add `{6,}` at the end to limit character number. – amnesic Jul 18 '22 at 11:24
  • And yes they have to start with a digit and then can repeat any amount of digit or letters. – amnesic Jul 18 '22 at 11:26

1 Answers1

0

Your way to limit the character length is not correct, see why at Restricting character length in a regular expression.

You can use

df['results'] = df['description'].str.extract(r'^(\d[^\W_]{5,})')

See the regex demo.

Details:

  • ^ - start of string
  • (\d[^\W_]{5,}) - Group 1:
    • \d - a digit
    • [^\W_]{5,} - five or more letters or digits.

If the match is not always expected at the start of string, replace the ^ anchor with the numeric ((?<!\d)) or a word (\b) boundary.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563