1

I need a regex that will give me the following results from each example and I can't seem to get it right:

example.com yields -> nothing / empty

www.example.com yields -> nothing / empty

account.example.com yields -> account

mywww.example.com yields -> mywww

wwwboys.example.com yields -> wwwboys

cool-www.example.com yields -> cool-www

So, it doesn't matter if they use 'www' in the subdomain, but it can't be only 'www'. It can also contain hyphens.

orokusaki
  • 55,146
  • 59
  • 179
  • 257

3 Answers3

1
x="""example.com yields -> nothing / empty

www.example.com yields -> nothing / empty

account.example.com yields -> account

mywww.example.com yields -> mywww

wwwboys.example.com yields -> wwwboys

cool-www.example.com yields -> cool-www"""

>>> re.findall("^([A-Za-z0-9-]+)\.(?<!^www\.)[A-Za-z0-9-]+\.[A-Za-z]+",x,re.MULTILINE)
['account', 'mywww', 'wwwboys', 'cool-www']
YOU
  • 120,166
  • 34
  • 186
  • 219
1
mystrings="""
example.com
www.example.com
account.example.com
mywww.example.com
wwwboys.example.com
cool-www.example.com
"""

junk=["example.com","www.example.com"]
for url in mystrings.split("\n"):
    if url and not url.strip() in junk:
       print "-->",url.split(".",2)[0]

output

$ ./python.py
--> account
--> mywww
--> wwwboys
--> cool-www
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • @ghostdog74 +1 Wow, so no re module needed? Thanks. – orokusaki Mar 10 '10 at 04:35
  • @ghostdog74 On second thought, that's way better. Now I can have a configuration setting to add more default not-allowed subdomains (like `api.example.com`, etc). – orokusaki Mar 10 '10 at 04:36
  • This fails for other input, like "www.google.com" or "www.stackoverflow.com", because it doesn't really check if the subdomain is "www". –  Mar 10 '10 at 04:57
  • right, but OP's sample strings is just that, all with "example.com". "example.com" may be "google.com" for all you know. – ghostdog74 Mar 10 '10 at 04:59
  • @Roger @ghost It's OK. It showed me what I needed to make my function. I'll post an extra answer below with my solution built on this. – orokusaki Mar 10 '10 at 05:33
0

Here's my solution based on ghostdog74's example:

OFF_LIMITS = ('api', 'www', 'secure', 'account')

def get_safe_subdomain_or_none(host):
    subdomain = None
    L = host.split('.')
    if len(L) is 3 and not L[0] in OFF_LIMITS:  # 3 ensures that you don't have a sub-sub domain, and that you don't have just `example.com`
        subdomain = L[0]
    return subdomain
orokusaki
  • 55,146
  • 59
  • 179
  • 257
  • Use == instead of *is* with numbers. What about "www.blah.example.com"? –  Mar 10 '10 at 05:54
  • @Roger In the case of `www.blah.example.com`, it returns `None` as it should, but I could modify it to sort out sub-sub domains. Also, I only used `is` instead of `==` because `is` is sort of like `===` and I know that it needs to be exactly `3`. Is that frowned upon in the Python world for esoteric style reasons or is it bad practice? Either way, I can change it. – orokusaki Mar 10 '10 at 17:07
  • I asked about www.blah... because it wasn't clear to me what behavior you wanted in that case. *is* is not sort of like ===; *is* checks object identity, while === (in other languages) checks value and type. You will almost exclusively use *is* with None and similar singletons. –  Mar 11 '10 at 05:39
  • @Roger Roger that. I read that before, but for some reason it didn't stay very long. Thank you. – orokusaki Mar 11 '10 at 16:56