Pandas str.extract() to limit number of alphanumeric characters

Question

I have a pandas dataframe of descriptions like this:

df['description']

22CI003294 PARCEL 32 
22CI400040 NORFOLK ESTATES 
12CI400952 & 13CI403261
22CI400628 GARDEN ACRES
9CI00208 FERNHAVEN SEC
22CI400675 CECIL AVE SUB
22CI400721 124.69' SS
BOLLING AVE SS

I want to extract the first alphanumeric characters that are at least 6 characters in length. They have to start with a digit and then can repeat any amount of digit or letters. So, expected results from above:

22CI003294
22CI400040
12CI400952
22CI400628
9CI00208
22CI400675
22CI400721
None

What I have tried:

df['results'] = df['description'].str.extract(r'(\d*\w+\d+\w*){6,}')

When I added in {6,} at the end I suddenly get no matches. Please advise.

Maybe you need `^([^\W_]{6,})`? Or, `^(?=[^\W\d_]*\d)[^\W_]{6,}`? Or even `^(?=\d*[^\W\d_])(?=[^\W\d_]*\d)[^\W_]{6,}`? — Wiktor Stribiżew, Jul 18 '22 at 11:13
I just need to filter out by character numbers once it's matched. Why doesn't my regex work after I added in `{6,}`? — amnesic, Jul 18 '22 at 11:17
It makes no sense here, you quantified the whole sequence and using it with `str.extract`, you just would have extracted the last capture. — Wiktor Stribiżew, Jul 18 '22 at 11:18
So, what is that you are after? Extract the alphanumeric string with at least one digit and at least one letter at the start of string if there are 6 or more chars? — Wiktor Stribiżew, Jul 18 '22 at 11:19
Basically, `^([^\W_]{6,})` as you suggested will work. I want to extract any alphanumeric characters that are greater than 6 characters. My regex of `(\d*\w+\d+\w*)` matches the numbers just fine. But I have no idea why I couldn't add `{6,}` at the end to limit character number. — amnesic, Jul 18 '22 at 11:24
And yes they have to start with a digit and then can repeat any amount of digit or letters. — amnesic, Jul 18 '22 at 11:26

score 0 · Answer 1 · answered Jul 18 '22 at 11:29

0

Your way to limit the character length is not correct, see why at Restricting character length in a regular expression.

You can use

df['results'] = df['description'].str.extract(r'^(\d[^\W_]{5,})')

See the regex demo.

Details:

^ - start of string
(\d[^\W_]{5,}) - Group 1:
- \d - a digit
- [^\W_]{5,} - five or more letters or digits.

If the match is not always expected at the start of string, replace the ^ anchor with the numeric ((?<!\d)) or a word (\b) boundary.

answered Jul 18 '22 at 11:29

Wiktor Stribiżew

607,720
39
448
563

I added an additional example at the end so that it won't match pure letter words. – amnesic Jul 18 '22 at 11:31
1

@amnesic Yes, `^(\d[^\W_]{5,})`, `(?<!\d)(\d[^\W_]{5,})` and `\b(\d[^\W_]{5,})` do not match `BOLLING AVE SS`. The current solution still stands. – Wiktor Stribiżew Jul 18 '22 at 11:32

Pandas str.extract() to limit number of alphanumeric characters

1 Answers1