Regex extract full capture group

Question

I'm trying to extract URLs but I am only getting the last portion like the "com" and not the full "amazon.com" or "google.com". I'm using the following regex:

data = [['website is amazon.com'], ['url is google.com']] 
reviews = pd.DataFrame(data, columns = ['ALL_TEXT']) 
reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(r'[^@A-Z][-A-Z0-9:%_\+~#=]+\.(CO|COM|NET|ORG|GOV)\b', flags=re.IGNORECASE)

I tried to use a capture group around the full regex

reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(r'([^@A-Z][-A-Z0-9:%_\+~#=]+\.(CO|COM|NET|ORG|GOV)\b)', flags=re.IGNORECASE)

but I get the error

Wrong number of items passed 2, placement implies 1

Wiktor Stribiżew · Accepted Answer · 2021-01-21T22:09:43.017

The error means that you are assigning the result of Series.str.extract into a single column (reviews['regex_match']), but your regex contains two capturing groups, i.e. you tell it to return two columns.

You can use

>>> reviews['ALL_TEXT'].str.extract(r'(?<![@A-Z])([-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b', flags=re.I)
            0
0  amazon.com
1  google.com

Details:

(?<![@A-Z]) - a negative lookbehind that fails the match if there is a @ or an ASCII letter immediately to the left of the current location
([-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV)) - Capturing group 1 (this will be returned by Series.str.extract):
- [-A-Z0-9:%_+~#=]+ - one or more ASCII letters/digits, -, :, %, _, +, ~, # or = chars
- \. - a . char
- (?:COM?|NET|ORG|GOV) - a non-capturing group matching co, com, net, org or gov
\b - a word boundary.

So, you only use a single capturing group to return the value for the single column you define to the left of the = operator, and if you need to group any two or more patterns, you just use non-capturing groups.

The fourth bird · Answer 2 · 2021-01-21T22:40:00.857

You get that error because you are passing 2 capture groups. You can use a non capture group using (?: for the extensions and use a single capture group for the full pattern.

([^@A-Z][-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b
                           |__________________|
                             Non capture group
|______________________________________________|
                  Capture group

The updated code could look like

reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(
    r'([^@A-Z][-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b',
    flags=re.IGNORECASE
)

Output

                ALL_TEXT  regex_match
0  website is amazon.com   amazon.com
1      url is google.com   google.com

Regex extract full capture group

2 Answers2

Linked

Related