0

I have a list of URLs in formats such as "www.blah.com/en-us" and I need to cut-off anything after the "www.blah.com". I've tried using the following:

import re
website = www.blah.com/en-us
cleanURL = re.sub('(.|\n)*?com', "", website)

Output: 'en-us'

So I'm getting the opposite of what I want. Sorry if this post isn't correctly formatted, first time asking a question.

user94559
  • 59,196
  • 6
  • 103
  • 103
  • Strange, when I run your code, I don't get `en-us`, I get `NameError: name 'www' is not defined`. Are you sure this is the exact code you're running? – Kevin Jul 06 '17 at 19:30
  • Possibly a duplicate of https://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex – Evan Wise Jul 06 '17 at 19:32

2 Answers2

4

How about just using

website = "www.blah.com/en-us"
cleanURL = website.split("/",1)[0]

?

Fulgen
  • 351
  • 2
  • 13
2

Is using regex a must? If there's no protocol (e.g. http://) in the URLs that you're trying to process, you could just use your_url_string.split('/', 1)[0] which should split on the first instance of '/' and gives you the part before the split.

Andrew Zick
  • 582
  • 7
  • 23