extract contact information from html with python

Question

Here is a sample html

<div class="yui3-u-5-6" id="browse-products">
<div id="kazbah-contact">
  <span class="contact-info-title">Contact 00Nothing:</span>
  <a href="mailto:info@00nothing.com">info@00nothing.com</a> | 800-410-2074
   | C/O Score X Score
    &nbsp;8118-D Statesville Rd
    ,
  Charlotte,
  NC
  28269
</div>
<div class="clearfix"></div>

I want to extract the contact information here, email, phone, and address. How should I do that with python? Thanks

Take a look at this: http://stackoverflow.com/questions/11709079/parsing-html-python — rafaelc, Apr 14 '15 at 22:40
@RafaelCardoso I read that. But How I can get the information after "|"? I mean, get info@00nothing.com is easy, but it's hard to get phone and address — Artanis Wong, Apr 14 '15 at 23:05
Perhaps the documentation of [`split`](https://docs.python.org/3/library/stdtypes.html#str.split) will show you how you can extract those "hard" parts. Also, consider in the future that you'll get (better) answers if you show some form of code that you've tried yourself. If you specifically write that getting the emailaddress is easy, then why haven't you copied the code you're using in your question? Check out [writing the perfect question](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) and [how to ask](https://stackoverflow.com/help/how-to-ask). — Oliver W., Apr 15 '15 at 00:01

score 0 · Accepted Answer · answered Apr 15 '15 at 15:05

I use this code to extract information

# _*_ coding:utf-8 _*_
import urllib2
import urllib
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def grabHref(url,localfile):
    html = urllib2.urlopen(url).read()
    html = unicode(html,'gb2312','ignore').encode('utf-8','ignore')
    soup = BeautifulSoup(html)
    myfile = open(localfile,'wb')
    for link in soup.select("div >            a[href^=http://www.karmaloop.com/kazbah/browse]"):
        for item in BeautifulSoup(urllib2.urlopen(link['href']).read()).select("div > a[href^=mailto]"):
            contactInfo = item.get_text()
            print link['href']
            print contactInfo

        myfile.write(link['href'])
        myfile.write('\r\n')
        myfile.write(contactInfo)
        myfile.write('\r\n')
    myfile.close()



def main():
    url = "http://www.karmaloop.com/brands"
    localfile = 'Contact.txt'
    grabHref(url,localfile)
if __name__=="__main__":
    main()

But I still can only get email address here, how can I get phone number and address? Thanks

I get if right now. But for the css selector, "div > a[href^=mailto]" may not exist. I want to continue if can't find "div > a[href^=mailto]", how should I do? — Artanis Wong, Apr 15 '15 at 17:04
I write if BeautifulSoup(urllib2.urlopen(link['href']).read()).select("div > div[id^=kazbah-contact]") == False: continue, but it doesn't work — Artanis Wong, Apr 15 '15 at 17:05
Welcome to Stack Overflow. This is not an answer. You should either edit your original question to include the new information, or open a separate question. — Bryan, Apr 17 '15 at 01:14

extract contact information from html with python

1 Answers1