11

I am brand new to Python and have been working with it for a few weeks. I have a list of strings and want to remove the first four and last four characters of each string. OR, alternatively, removing specific character patterns (not just specific characters).

I have been looking through the archives here but don't seem to find a question that matches this one. Most of the solutions I have found are better suited to removing specific characters.

Here's the strings list I'm working with:

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']

What I am trying to do is to isolate the domain names and get

[hattrick, google, wampum, newcom]

This question is NOT about isolating domain names from URLs (I have seen the questions about that), but rather about editing specific characters in strings in lists based upon location or pattern.

So far, I've tried .split, .translate, .strip but these don't seem appropriate for what I am trying to do because they either remove too many characters that match the search, aren't good for recognizing a specific pattern/grouping of characters, or cannot work with the location of characters within a string.

Any questions and suggestions are greatly appreciated, and I apologize if I'm asking this question the wrong way etc.

egon
  • 949
  • 2
  • 7
  • 12

5 Answers5

15
def remove_cruft(s):
    return s[4:-4]

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
[remove_cruft(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom']

If you know all of the strings you want to strip out, you can use replace to get rid of them. This is useful if you're not sure that all of your URLs will start with "www.", or if the TLD isn't three characters long.

def remove_bad_substrings(s):
    badSubstrings = ["www.", ".com", ".net", ".museum"]
    for badSubstring in badSubstrings:
        s = s.replace(badSubstring, "")
    return s

sites=['www.hattrick.com', 'www.google.com', 
'www.wampum.net', 'www.newcom.com', 'smithsonian.museum']
[remove_bad_substrings(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom', 'smithsonian']
Kevin
  • 74,910
  • 12
  • 133
  • 166
  • 2
    why not just ``[s[4:-4] for s in sites]``? function seems overkill – jterrace Aug 06 '12 at 17:28
  • 4
    @jterrace, as OP is admittedly a beginner, I wanted the answer to be useful even if he does not know how list comprehensions work. Even if the last line is incomprehensible to him, he can still understand that `removeCruft` is doing the work he wants. – Kevin Aug 06 '12 at 17:33
  • 1
    Python recommends CamelCase only for classes, I would recommend you use names_with_underscores for funcs next time ;) – jamylak Aug 07 '12 at 04:07
5

You could use the tldextract module, which is much more robust than parsing the strings yourself:

>>> sites=['www.hattrick.com', 'google.co.uk',
           'apps.s3.stackoverflow.com', 'whitehouse.gov']
>>> import tldextract
>>> [tldextract.extract(s).domain for s in sites]
['hattrick', 'google', 'stackoverflow', 'whitehouse']
jterrace
  • 64,866
  • 22
  • 157
  • 202
2

Is this what you mean:

>>> sites=['nosubdomain.net', 'ohcanada.ca', 'www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
>>> print [x.split('.')[-2] for x in sites]
['nosubdomain', 'ohcanada', 'hattrick', 'google', 'wampum', 'newcom']
alan
  • 4,752
  • 21
  • 30
1

Reading your subject, this is an answer, but maybe not what you are looking for.

for site in sites:
    print(site[:4]) # www .
    print(site[-4:]) # .com / .net / ...

You could also use regex:

import re
re.sub('^www\.','',sites[0])  # removes 'www.' if exists
re.sub('\.\w+$','',sites[0])  # removes chars after last dot & dot
Qiau
  • 5,976
  • 3
  • 29
  • 40
0

I'm not clear about your requirements for removing specific characters, but if all you want to do is remove the first and last four characters, you can use python's built in slicing:

str = str[4:-4]

This will give you the substring starting at index 4, up to but not including the 4th-last index of the string.

EDIT: here is a good question that provides lots of info about python's slice notation.

Community
  • 1
  • 1
Gordon Bailey
  • 3,881
  • 20
  • 28