4

I'm trying to extract some informations from a website, but I don't know how to scrape the email.

This code works for me :

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup

url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
soup = BeautifulSoup(page_html,"lxml")

members = soup.findAll("b")
for member in members:
    member = members[0].text
print(member)

I wanted to extract the number and link with soup.findAll() but can't find a way to get the text properly so I used the SelectorGadget tool and tried this :

numbers = soup.select("#content li:nth-child(1)")
for number in numbers:
    number = numbers[0].text
print(number)

links = soup.findAll(".icon-globe+ a")
for link in links:
    link = links[0].text
print(link)

It prints correctly :

2 L'Eau Protection
 (+33) 02 98 19 43 86
http://www.2leau-protection.com/

Now, when it comes to extract the email address i'm stuck. I'm new to this, any advice would be appreciate, thank you!

Attempt 1

emails = soup.select("#content li:nth-child(2)")
for email in emails:
    email = emails[0].text
print(email)

I don't even know what it just prints

//<![CDATA[
var l=new Array();
l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
for (var i = l.length-1; i >= 0; i=i-1){
if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
else document.write(unescape(l[i]));}
//]]>

Attempt 2

emails = soup.select(".icon-mail~ a") #follow the same logic
for email in emails:
    email = emails[0].text
print(email)

Error

NameError: name 'email' is not defined

Attempt 3

emails = soup.select(".icon-mail~ a")
print(emails)

Print empty

[]

Attempt 4,5,6

email = soup.find("a",{"href":"mailto:"}) # Print "None"

email = soup.findAll("a",{"href":"mailto:"}) # Print empty "[]"

email = soup.select("a",{"href":"mailto:"}) # Print a lot of informations but not the one that I need.
NK20
  • 75
  • 1
  • 1
  • 5

7 Answers7

3

urllib and beautifulsoup combination may be insufficient in some cases like a webpage running and displaying information by a API call or JavaScript. You are getting the very first instance of the website, before it loads anything externally. That's why you may need to emulate a real browser somehow. You can do it by using javascript calls, however there is a more convenient way.

Selenium library is being employed for automating web-tasks and test automation. It can be also employed as a scraper. Since it uses a real backbone of a browser (like Mozilla Gecko or Google Chrome Driver) it appears to be more robust for most of the cases. Here is an example of how you can accomplish your task:

from selenium import webdriver

url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"


option = webdriver.ChromeOptions()
option.add_argument("--headless")
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)

browser.get(url)

print(browser.find_element_by_css_selector(".icon-mail~ a").text)

The output is:

information@2leau-protection.com

Edit: You can obtain selenium by pip install selenium and you can find chrome driver from here

  • Selenium installed and Chrome driver downloaded, I tried your code and it gives me something like this.. Traceback (most recent call last): File "/Users/monicasen/Desktop/python/Eurocham Gamma.py", line 8, in browser = webdriver.Chrome(executable_path="./chromedriver", options=option) File "/Users/monicasen/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__ desired_capabilities=desired_capabilities) File "/Users/monicasen/anaconda3/lib/python3.7/site- – NK20 Sep 15 '19 at 13:33
  • Which OS you use? Are you sure selenium finds your webdriver? – Furkan Küçük Sep 15 '19 at 13:35
  • Mine is OS X 10.14.4. And no i'm not sure lol, I never use Selenium before, need to read the documentation – NK20 Sep 15 '19 at 13:42
  • Assuming you are using Windows, if Google Chrome is installed within your system, you can just use browser = webdriver.Chrome(options=option). If not, just download chromedriver for the Windows, place it in the path where your python file is, and use browser = webdriver.Chrome(executable_path="chromedriver.exe", options=option) – Furkan Küçük Sep 15 '19 at 13:43
  • what about Mac? – NK20 Sep 15 '19 at 13:47
  • Selenium is widely adapted and well documented. The official documentation for installation is here: https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver – Furkan Küçük Sep 15 '19 at 13:48
3

The reason cannot scrape the given part of that website is because it is generated by JavaScript and is not present initially . This can be checked by the following code snippet

    import lxml
    import requests

    page = requests.get(https://www.eurocham-cambodia.org/member/476/2-LEau- Protection).text
    tree = html.fromstring(page)
    print(lxml.html.tostring(tree, pretty_print=True).decode())

which gives the complete HTML document to you , but let us just focus on the div containing the users profile.

    <div class="col-sm-12 col-md-6">
       <ul class="iconlist">
          <li>
             <i class="icon-phone"> </i>(+33) 02 98 19 43 86</li>

          <li>
              <i class="icon-mail"> </i><script type="text/javascript">
                //<![CDATA[
                var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
    else document.write(unescape(l[i]));}
    //]]>
              </script>
           </li>
           <li>
            <i class="icon-globe"></i> <a href="http://www.2leau-protection.com/" target="_blank"><i style="background-color:#2C3E50"></i>http://www.2leau-protection.com/</a>
          </li>
        </ul>
     </div>

see carefully , this is the same script which you scraped above when you were trying to scrape the emails in your Attempt 1.

Mukul Kumar Jha
  • 1,062
  • 7
  • 19
3

I see that you already have perfectly acceptable answers, but when I saw that obfuscation script I was fascinated, and just had to "de-obfuscate" it.

from bs4 import BeautifulSoup
from requests import get
import re

page = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"

content = get(page).content
soup = BeautifulSoup(content, "lxml")

exp = re.compile(r"(?:.*?='(.*?)')")
# Find any element with the mail icon
for icon in soup.findAll("i", {"class": "icon-mail"}):
    # the 'a' element doesn't exist, there is a script tag instead
    script = icon.next_sibling
    # the script tag builds a long array of single characters- lets gra
    chars = exp.findall(script.text)
    output = []
    # the javascript array is iterated backwards
    for char in reversed(list(chars)):
        # many characters use their ascii representation instead of simple text
        if char.startswith("|"):
            output.append(chr(int(char[1:])))
        else:
            output.append(char)
    # putting the array back together gets us an `a` element
    link = BeautifulSoup("".join(output))
    email = link.findAll("a")[0]["href"][8:]
    # the email is the part of the href after `mailto: `
    print(email)
Paul Becotte
  • 9,767
  • 3
  • 34
  • 42
  • Lol, sorry for the obfuscation script, honestly I don't understand your script yet, BUT it works! Thank you a lot, now i'm going to study that – NK20 Sep 15 '19 at 15:54
2

If you want to find the email address, you can use regex to do so. Import the module and search the text and extract the data and put it in a list.

import re
..
text = soup.get_text()
list = re.findall(r'[a-z0-9]+@[gmail|yahoo|rediff].com', text)
for email in list:
    print(email)

Let me know the result. Happy coding!

Alex Waygood
  • 6,304
  • 3
  • 24
  • 46
Nishant Jalan
  • 844
  • 9
  • 20
2

BeautifulSoup only handles the HTML of the page, it does not execute any JavaScrip. The email address is generated with JavaScript as the document is loaded (probably to make it harder to scrape that information).

In this case it is generated by:

<script type="text/javascript">
    //<![CDATA[
    var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
    else document.write(unescape(l[i]));}
    //]]>
</script>
Roger Lindsjö
  • 11,330
  • 1
  • 42
  • 53
1
import re
text =soup.get_text()
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))
print(emails)

this is a much more convenient way to print emails form a website

-1

I found this method more accurate...

text = get(url).content
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))
  • .io .org .edu. net. .co.uk .mobi -- and a thousand other valid emails will not be captured. nor will first_name@, or first.lastName@ .. – mbunch Feb 10 '21 at 16:09