-1

I am able to get the URLs in the search page using the script below

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

   def scrape_google(query):

    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)

    links = list(response.html.absolute_links)
    google_domains = ('https://www.google.', 
                      'https://google.', 
                      'https://webcache.googleusercontent.', 
                      'http://webcache.googleusercontent.', 
                      'https://policies.google.',
                      'https://support.google.',
                      'https://maps.google.',
                      'https://play.google.')
    https = ('https://')

    for url in links[:]:
        if url.startswith(google_domains):
            links.remove(url)
        
    return links

And now I want to get plain domains without https, www or anything like below

wiki.org
itroasters.com

And also need to remove any duplicates.

Could anyone please help me to get the expected result?

Thanks

  • 1
    This is not a web-scraping question, it's a simple text parsing question. Have you tried anything at all? Do you realize Python has a `urllib` module that includes a `parse` function? – Tim Roberts Jun 30 '23 at 05:45
  • Does this answer your question? [Extract domain name from URL in Python](https://stackoverflow.com/questions/44021846/extract-domain-name-from-url-in-python) – enricog Jun 30 '23 at 05:49
  • Thanks, Tim. I have updated the Tags. Yes, I understand that urilib includes a parse function. As I'm new to using Python, I couldn't find how to use that in this script. – Achyut Kumar Ch Jun 30 '23 at 05:51
  • 1
    I can understand why you might want to remove the scheme but why might you want to remove 'www'? Remember, 'www.a.b.c' is not necessarily the same as 'a.b.c' – DarkKnight Jun 30 '23 at 06:34

1 Answers1

0

The use-case for removing a 'www.' preamble to a netloc/path is not explained in the question and is probably unwise.

Here's a pattern that will provide isolation of the netloc/path with or without a 'www.' preamble. This code will also handle URLs that do not have a scheme. Any other prefixes that need to be removed can be added to the PREFIXES list:

from urllib.parse import urlparse
from typing import Iterator

# a list of prefixes to remove
# there's only one initially but this construct means that others
# could be added without needing to adjust the runtime code
PREFIXES = [
    'www.'
]

google_domains = [
    'https://www.google.',
    'https://google.',
    'https://webcache.googleusercontent.',
    'http://webcache.googleusercontent.',
    'https://policies.google.',
    'https://support.google.',
    'https://maps.google.',
    'https://play.google.',
]

def strip_scheme(urls: list[str], remove: bool=False) -> Iterator[str]:
    for url in urls:
        _, netloc, path, *_ = urlparse(url)
        rv = netloc or path
        if remove:
            for prefix in PREFIXES:
                if rv.startswith(prefix):
                    rv = rv[len(prefix):]
                    break
        yield rv


print(set(strip_scheme(google_domains)))
print()
print(set(strip_scheme(google_domains, True)))

Output:

['maps.google.', 'support.google.', 'webcache.googleusercontent.', 'google.', 'www.google.', 'policies.google.', 'play.google.']

['maps.google.', 'support.google.', 'webcache.googleusercontent.', 'google.', 'policies.google.', 'play.google.']
DarkKnight
  • 19,739
  • 3
  • 6
  • 22