2

I'm trying to extract URLs but I am only getting the last portion like the "com" and not the full "amazon.com" or "google.com". I'm using the following regex:

data = [['website is amazon.com'], ['url is google.com']] 
reviews = pd.DataFrame(data, columns = ['ALL_TEXT']) 
reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(r'[^@A-Z][-A-Z0-9:%_\+~#=]+\.(CO|COM|NET|ORG|GOV)\b', flags=re.IGNORECASE)

I tried to use a capture group around the full regex

reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(r'([^@A-Z][-A-Z0-9:%_\+~#=]+\.(CO|COM|NET|ORG|GOV)\b)', flags=re.IGNORECASE)

but I get the error

Wrong number of items passed 2, placement implies 1
user3242036
  • 645
  • 1
  • 7
  • 16

2 Answers2

1

The error means that you are assigning the result of Series.str.extract into a single column (reviews['regex_match']), but your regex contains two capturing groups, i.e. you tell it to return two columns.

You can use

>>> reviews['ALL_TEXT'].str.extract(r'(?<![@A-Z])([-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b', flags=re.I)
            0
0  amazon.com
1  google.com

Details:

  • (?<![@A-Z]) - a negative lookbehind that fails the match if there is a @ or an ASCII letter immediately to the left of the current location

  • ([-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV)) - Capturing group 1 (this will be returned by Series.str.extract):

    • [-A-Z0-9:%_+~#=]+ - one or more ASCII letters/digits, -, :, %, _, +, ~, # or = chars
    • \. - a . char
    • (?:COM?|NET|ORG|GOV) - a non-capturing group matching co, com, net, org or gov
  • \b - a word boundary.

So, you only use a single capturing group to return the value for the single column you define to the left of the = operator, and if you need to group any two or more patterns, you just use non-capturing groups.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You get that error because you are passing 2 capture groups. You can use a non capture group using (?: for the extensions and use a single capture group for the full pattern.

([^@A-Z][-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b
                           |__________________|
                             Non capture group
|______________________________________________|
                  Capture group

The updated code could look like

reviews['regex_match'] = reviews['ALL_TEXT'].str.extract(
    r'([^@A-Z][-A-Z0-9:%_+~#=]+\.(?:COM?|NET|ORG|GOV))\b',
    flags=re.IGNORECASE
)

Output

                ALL_TEXT  regex_match
0  website is amazon.com   amazon.com
1      url is google.com   google.com
The fourth bird
  • 154,723
  • 16
  • 55
  • 70