0

I'm completely new to Python and could really use some assistance.

I'm trying to parse a webpage and retrieve the email addresses off the webpage. Ive tried many things that I've read online and failed.

I realized that when is run BeautifulSoup(browser.page_source) it brings the source code through however for some reason it doesn't bring the email address with it or the business profiles.

Below is my code (don't judge :-))

import os, random, sys, time

from urllib.parse import urlparse

from selenium import webdriver

from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager

import lxml

browser = webdriver.Chrome('./chromedriver.exe')

url = ('https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1')
browser.get(url)

BeautifulSoup(browser.page_source)

Sidenote: My goal is to navigate the webpages based on search criteria and parse each page for the email addresses, Ive figured out how to navigate the webpages and send keys, it's just the parsing that I'm stuck with. Your help would be greatly appreciated

Noob
  • 1
  • Does this answer your question? [Parsing Web Page's Search Results With Python](https://stackoverflow.com/questions/15044563/parsing-web-pages-search-results-with-python) – picklu May 16 '20 at 13:55

1 Answers1

1

I recomend you to use the requests module to get the page source:

from requests import get

url = 'https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1'
src = get(url).text  # Gets the Page Source

After that I searched for email formatted words and added them to a list:

src = src.split('<body>')[1]  # Splits it and gets the <body> part

emails = []

for ind, char in enumerate(src):
    if char == '@':
        add = 1  # Count the characteres after and before
        new_char = src[ind+add]  # New character to add to the email
        email = char  # The full email (not yet)

        while new_char not in '<>":':
            email += new_char  # Add to email

            add += 1                   # Readjust
            new_char = src[ind + add]  # Values

        if '.' not in email or email.endswith('.'):  # This means that the email is 
            continue                                 # not fully in the page

        add = 1                    # Readjust
        new_char = src[ind - add]  # Values

        while new_char not in '<>":':
            email = new_char + email  # Add to email

            add += 1                   # Readjust
            new_char = src[ind - add]  # Values

        emails.append(email)

At last, you can use set to remove duplicates and print the emails

emails = set(emails)  # Remove Duplicates

print(*emails, sep='\n')
Rafael Setton
  • 352
  • 1
  • 8
  • Thank you Rafael, I gave it a go and the same thing happened. It seems when I print the source code, it leaves out the entire first section which contains all the email addresses and only prints the last part. Any suggestions? – Noob May 16 '20 at 15:01
  • What do you mean by 'the last part'? – Rafael Setton May 18 '20 at 12:16
  • Basically there are a total of 3060 lines of code in the source code on the actual web page. When we parse the source code using Python it only takes the source code from line 1760 to 3060 – Noob May 20 '20 at 12:50