removing first four and last four characters of strings in list, OR removing specific character patterns

Question

I am brand new to Python and have been working with it for a few weeks. I have a list of strings and want to remove the first four and last four characters of each string. OR, alternatively, removing specific character patterns (not just specific characters).

I have been looking through the archives here but don't seem to find a question that matches this one. Most of the solutions I have found are better suited to removing specific characters.

Here's the strings list I'm working with:

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']

What I am trying to do is to isolate the domain names and get

[hattrick, google, wampum, newcom]

This question is NOT about isolating domain names from URLs (I have seen the questions about that), but rather about editing specific characters in strings in lists based upon location or pattern.

So far, I've tried .split, .translate, .strip but these don't seem appropriate for what I am trying to do because they either remove too many characters that match the search, aren't good for recognizing a specific pattern/grouping of characters, or cannot work with the location of characters within a string.

Any questions and suggestions are greatly appreciated, and I apologize if I'm asking this question the wrong way etc.

To get split to work for strings in a list: [string.split(delimiter) for string in list] — Jordan, Aug 06 '12 at 17:35

Kevin · Accepted Answer · 2012-08-07T11:43:45.813

15

def remove_cruft(s):
    return s[4:-4]

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
[remove_cruft(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom']

If you know all of the strings you want to strip out, you can use replace to get rid of them. This is useful if you're not sure that all of your URLs will start with "www.", or if the TLD isn't three characters long.

def remove_bad_substrings(s):
    badSubstrings = ["www.", ".com", ".net", ".museum"]
    for badSubstring in badSubstrings:
        s = s.replace(badSubstring, "")
    return s

sites=['www.hattrick.com', 'www.google.com', 
'www.wampum.net', 'www.newcom.com', 'smithsonian.museum']
[remove_bad_substrings(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom', 'smithsonian']

edited Aug 07 '12 at 11:43

answered Aug 06 '12 at 17:27

Kevin

74,910
12
133
166

2

why not just ``[s[4:-4] for s in sites]``? function seems overkill – jterrace Aug 06 '12 at 17:28
4

@jterrace, as OP is admittedly a beginner, I wanted the answer to be useful even if he does not know how list comprehensions work. Even if the last line is incomprehensible to him, he can still understand that `removeCruft` is doing the work he wants. – Kevin Aug 06 '12 at 17:33
1

Python recommends CamelCase only for classes, I would recommend you use names_with_underscores for funcs next time ;) – jamylak Aug 07 '12 at 04:07

score 5 · Answer 2 · answered Aug 06 '12 at 17:33

You could use the tldextract module, which is much more robust than parsing the strings yourself:

>>> sites=['www.hattrick.com', 'google.co.uk',
           'apps.s3.stackoverflow.com', 'whitehouse.gov']
>>> import tldextract
>>> [tldextract.extract(s).domain for s in sites]
['hattrick', 'google', 'stackoverflow', 'whitehouse']

alan · Answer 3 · 2012-08-06T17:40:29.700

2

Is this what you mean:

>>> sites=['nosubdomain.net', 'ohcanada.ca', 'www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
>>> print [x.split('.')[-2] for x in sites]
['nosubdomain', 'ohcanada', 'hattrick', 'google', 'wampum', 'newcom']

edited Aug 06 '12 at 17:40

answered Aug 06 '12 at 17:34

alan

4,752
21
30

Qiau · Answer 4 · 2012-08-06T17:34:35.360

1

Reading your subject, this is an answer, but maybe not what you are looking for.

for site in sites:
    print(site[:4]) # www .
    print(site[-4:]) # .com / .net / ...

You could also use regex:

import re
re.sub('^www\.','',sites[0])  # removes 'www.' if exists
re.sub('\.\w+$','',sites[0])  # removes chars after last dot & dot

edited Aug 06 '12 at 17:34

answered Aug 06 '12 at 17:28

Qiau

5,976
3
29
40

thanks for the regex example - had been too chicken to look into it. – egon Aug 06 '12 at 18:04

score 0 · Answer 5 · edited May 23 '17 at 10:31

I'm not clear about your requirements for removing specific characters, but if all you want to do is remove the first and last four characters, you can use python's built in slicing:

str = str[4:-4]

This will give you the substring starting at index 4, up to but not including the 4th-last index of the string.

EDIT: here is a good question that provides lots of info about python's slice notation.

removing first four and last four characters of strings in list, OR removing specific character patterns

5 Answers5