2

Hope you are all well. I'm new in Python and using python 2.7.

I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from a-z in the full directory. This directory does not have an API unfortunately. I'm using BeautifulSoup, but with no success so far.
here is mycode:

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?

Boris Schegolev
  • 3,601
  • 5
  • 21
  • 34
PIMg021
  • 83
  • 1
  • 8
  • Hi! Thanks for the reply! in the URL i read php? so i assumed that there might have been some php in it! Sorry if not! Still new in the coding. regards – PIMg021 Sep 23 '16 at 13:27
  • Hi can you please confirm me that there is no php involved so that i can edit the question removing the php tag? – PIMg021 Sep 23 '16 at 13:29

1 Answers1

3

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto::

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

Or extract the text:

print([a.text for a in soup.select("a[href^=mailto:]")])

Using find_all("a") you would need to use a regex to achieve the same:

import re

find_all("a", href=re.compile(r"^mailto:"))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • i'm modififed the code: ' import urllib import requests from bs4 import BeautifulSoup website = 'www.tecomdirectory.com/companies.php? segment=&activity=&search=category&submit=Search' html = urllib.urlopen('http://'+ website).read() soup = BeautifulSoup(requests.get(html).content) tags = soup('a') for tag in tags: print([a["href"] for a in soup.select("a[href^=mailto:]")]) ' however i get an error: with tracebacks with final comment : requests.execption.Invalid Schema! – PIMg021 Sep 23 '16 at 14:37
  • Yea because you are passing HTML to requests, pass the url and forget urllib or just use urllib and forget requests. – Padraic Cunningham Sep 23 '16 at 14:46
  • Hi Padraic! thank you for your patience, i modified the code and removed the urllib and passed the url to requests, this is the code: ' import requests from bs4 import BeautifulSoup soup = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=a0uD0000001jeAxIAI').content) tags = soup('a') for tag in tags: print([a["href"] for a in soup.select("a[href^=mailto:]")]) ' however i get empty lists print outs – PIMg021 Sep 23 '16 at 15:14
  • open the url in your browser and you will see why. – Padraic Cunningham Sep 23 '16 at 18:12
  • Now the problem is that the code starts and works, however it just reaches till where it says load more (in the webpage)! how can i put it so that it gets the full one ? – PIMg021 Sep 23 '16 at 18:34
  • @PIMg021, You should ask a new question for that. You need to recreate what happens when you click the load more button. I would suggest looking in chrome tools or firebug when you click the button to get an idea of what is happening. – Padraic Cunningham Sep 23 '16 at 18:41
  • Thanks Pardraic! Just posted a new question! will do! did you do any specific course for website data gathering? i'm doing coursera , but i still miss many info. :) – PIMg021 Sep 23 '16 at 18:48
  • @PIMg021, not really, you need to understand somewhat how http works, get familiar with html, css selectors, xpaths etc.. but the best way to learn is doing. – Padraic Cunningham Sep 23 '16 at 19:20
  • Thanks Padraic! Will do! This is the new quesiton on posted, please if you have any advise let me know: http://stackoverflow.com/questions/39667624/python-2-7-beautifulsoup-email-scraping-stops-before-end-of-full-database?noredirect=1#comment66637861_39667624 – PIMg021 Sep 23 '16 at 19:37
  • 4
    Just an FYI, if you are using BeautifulSoup 4.7+, the select answer will not work as it will raise a `SelectorSyntaxError`. You need to quote the attribute value in 4.7+ as `:` is not part of a valid CSS identifier. BeautifulSoup <=4.6 has a very limited select system that does not follow the CSS spec, but 4.7+ uses SoupSieve which greatly expands CSS support, but also follows the CSS spec closely. – facelessuser Feb 28 '19 at 17:02