0

Could someone explain why this works:

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie','Indomie','Indomie']
})

df['websites']= pd.Series(['http://imtiazconsultancy.co.uk', 'http://www.fidelitymortgageservices.com','https://willinghandsmi.com/','http://noname.co.za','https://nakaiindianjewelry.wixsite.com/nakaiindianjewerly',
'https://www.tranzact.net/?utm_source=google&utm_medium=organic&utm_campaign=gmb-local-listings&utm_content=charlotte-university', 'http://noname.co.ja'])
df['websites'] = df['websites'].str.extract(r"http(.*).com")
# df['websites'] = df['websites'].str.extract(r"http(.*).com|http(.*).uk|http(.*).za|http(.*).ja|http(.*).net|http(.*).site|http(.*).jp|http(.*).gov|http(.*).org|http(.*).edu")
print(df)

but this doesn't?:

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie','Indomie','Indomie']
})

df['websites']= pd.Series(['http://imtiazconsultancy.co.uk', 'http://www.fidelitymortgageservices.com','https://willinghandsmi.com/','http://noname.co.za','https://nakaiindianjewelry.wixsite.com/nakaiindianjewerly',
'https://www.tranzact.net/?utm_source=google&utm_medium=organic&utm_campaign=gmb-local-listings&utm_content=charlotte-university', 'http://noname.co.ja'])
df['websites'] = df['websites'].str.extract(r"http(.*).com|http(.*).uk|http(.*).za|http(.*).ja|http(.*).net|http(.*).site|http(.*).jp|http(.*).gov|http(.*).org|http(.*).edu")
print(df)

It's currently returning ValueError: Columns must be same length as key I know I'm misusing the | operator but can't figure out what I need to change. Thanks!

Justin Benfit
  • 423
  • 3
  • 11
  • 1
    Not current issue but you dont need to list protocol every time for regex. Just put TLDs in an or group. `str.extract(r"http(.*)[.](com|uk|za|ja|net|site|jp|gov|org|edu")` and put `.` in character class so it is literal. – user3783243 Jan 28 '22 at 18:40
  • 1
    It is because the regex contains more than one capturing group. So you need `r"http(.*)\.(?:com|uk|za|ja|net|site|jp|gov|org|edu)"`, i.e. you use **one capturing** group to get the text you need and any amount of **non-capturing groups** to group multiple patterns. – Wiktor Stribiżew Jan 28 '22 at 18:41
  • Oh that's cool! Thanks! – Justin Benfit Jan 28 '22 at 18:41
  • Thanks you @WiktorStribiżew! – Justin Benfit Jan 28 '22 at 18:45

0 Answers0