1

I'm new to scraping, and am learning to use BeautifulSoup but I'm having trouble scraping a table. For the HTML I'm trying to parse:

<table id="ctl00_mainContent_DataList1" cellspacing="0" > style="width:80%;border-collapse:collapse;"> == $0
    <tbody>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        ...

My code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

quote_page = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

table = soup.find('table', id="ctl00_mainContent_DataList1")
rows = table.findAll('tr')

I get AttributeError: 'NoneType' object has no attribute 'findAll'. I'm using python 3.6 and jupyter notebook for this in case that matters.

EDIT: The table data that I'm trying to parse only shows up on the page after requesting a search (In the city field, select Burnaby, and hit search). The table ctl00_mainContent_DataList1 is the list of dentists that shows up after the search is submitted.

eh2699
  • 127
  • 1
  • 4
  • 11
  • 1
    I can't find `ctl00_mainContent_DataList1` id on the page. Where is it exactly ? – Evya Jan 02 '18 at 08:32
  • use `find_all` not `findAll` https://stackoverflow.com/questions/12339323/beautifulsoup-findall-find-all – Albin Paul Jan 02 '18 at 08:35
  • 1
    @AlbinPaul This has nothing to do with the issue – DeepSpace Jan 02 '18 at 08:35
  • there is no table with this ID on this page. Maybe when you click `Search` then it send this table. But it means you have to make POST request with expected data. – furas Jan 02 '18 at 08:36
  • 2
    The table with `id=ctl00_mainContent_DataList1` is only displayed after a search is performed. It doesn't exist when the page loads for the first time. – DeepSpace Jan 02 '18 at 08:36
  • The reason you are getting a 'NoneType' object is that your soup.find() is not returning any matches, so table is set to None. Are you sure you typed the id correctly? Is the element generated dynamically, or does it exist on page load? – Costa Nostra Jan 02 '18 at 08:36
  • 2
    Use `DevTool` in Chrome/Firefox to see all requests send from browser to server when you use this page (tab Network). You will see also what data you have to send with POST request. Because this page is generated with `ASP.NET` you will have to find `__VIEWSTATE` (and similar) in HTML and use them in POST request. – furas Jan 02 '18 at 08:43
  • Can you say that whether error refers to the table or to the rows? – Hamed Baziyad Jan 02 '18 at 21:18
  • @furas I'm sorry for asking such a basic thing, but I'm a novice to scraping and making requests... I found the __VIEWSTATE but it's *extremely* long. I tried using some online decoders but they don't seem to be working – eh2699 Jan 03 '18 at 06:30
  • don't decode - you have to send it back without decoding. – furas Jan 03 '18 at 09:00
  • What the heck man!! You have got two answers which clearly solves the issue but you don't bother to accept either of them as your answer. Are you expecting more answers to come @eh2699? I'm gonna take out mine. Thanks. – SIM Jan 04 '18 at 06:59
  • @Shahin no! I'm just new to stack overflow and wasn't fully educated about the protocols! I really appreciated your answer and learned a lot from it. I apologize for not ticking off the checkmark right away, I know to do that now – eh2699 Jan 04 '18 at 07:50

1 Answers1

2

First: I use requests because it is easier to work with cookies, headers, etc.


Page is generated by ASP.net and it sends values __VIEWSTATE, __VIEWSTATEGENERATOR, __EVENTVALIDATION which you have to send in POST request too.

You have to load page using GET and then you can get those values.
You can also use request.Session() to get cookies which can be needed too.

Next you have to copy values and add parameters from form and send it using POST.

In code I put only parameters which are always send.

'526' is code for Vancouver. Other codes you can find in <select> tag.
If you want other options then you may have to add other parameters.

ie. ctl00$mainContent$chkUndr4Ref: on is for Children: 3 & Under - Diagnose & Refer

EDIT: because inside <tr> is <table> so find_all('tr') returns too many elements (external tr and internal tr) and and laterfind_all('td')give the sametdmany times. I changedfind_all('tr')intofind_all('table')` and it should stop duplicate data.

import requests
from bs4 import BeautifulSoup

url = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'

# --- session ---

s = requests.Session() # to automatically copy cookies
#s.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'})

# --- GET request ---

# get page to get cookies and params
response = s.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# --- set params ---

params = {
    # session - copy from GET request
    #'EktronClientManager': '',
    #'__VIEWSTATE': '',
    #'__VIEWSTATEGENERATOR': '',
    #'__EVENTVALIDATION': '',
    # main options
    'ctl00$terms': '',
    'ctl00$mainContent$drpCity': '526',
    'ctl00$mainContent$txtPostalCode': '',
    'ctl00$mainContent$drpSpecialty': 'GP',
    'ctl00$mainContent$drpLanguage': '0',
    'ctl00$mainContent$drpSedation': '0',
    'ctl00$mainContent$btnSearch': '+Search+',
    # other options
    #'ctl00$mainContent$chkUndr4Ref': 'on',
}

# copy from GET request
for key in ['EktronClientManager', '__VIEWSTATE', '__VIEWSTATEGENERATOR', '__EVENTVALIDATION']:
    value = soup.find('input', id=key)['value']
    params[key] = value
    #print(key, ':', value)

# --- POST request ---

# get page with table - using params
response = s.post(url, data=params)#, headers={'Referer': url})
soup = BeautifulSoup(response.text, 'html.parser')

# --- data ---

table = soup.find('table', id='ctl00_mainContent_DataList1')

if not table:
    print('no table')
    #table = soup.find_all('table')
    #print('count:', len(table))
    #print(response.text)
else:   
    for row in table.find_all('table'):
        for column in row.find_all('td'):
            text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
            print(text)

    print('-----')

Part of result:

Map
Dr. Kashyap Vora, 6145 Fraser Street, Vancouver  V5W 2Z9
604 321 1869, www.voradental.ca
-----
Map
Dr. Niloufar Shirzad, Harbour Centre DentalL19 - 555 Hastings Street West, Vancouver  V6B 4N6
604 669 1195, www.harbourcentredental.com
-----
Map
Dr. Janice Brennan, 902 - 805 Broadway West, Vancouver  V5Z 1K1
604 872 2525
-----
Map
Dr. Rosemary Chang, 1240 Kingsway, Vancouver  V5V 3E1
604 873 1211
-----
Map
Dr. Mersedeh Shahabaldine, 3641 Broadway West, Vancouver  V6R 2B8
604 734 2114, www.westkitsdental.com
-----
furas
  • 134,197
  • 12
  • 106
  • 148
  • yes I figured out the city codes and was in the process of understanding more about scraping, thank you! will ```to_csv``` work for exporting the dataset? – eh2699 Jan 03 '18 at 10:42
  • if you create `dataframe` and put data in `dataframe` then you can export it using `to_csv`. – furas Jan 03 '18 at 10:53
  • BTW: framework [scrapy](https://scrapy.org/) exports to `csv`, `json`, `xml` automatically. But then you don't use `requests` and don't need `BeautifulSoup`. – furas Jan 03 '18 at 10:56
  • thanks, and was wondering - why does the code print 2-3 duplicates of each entry in the output? – eh2699 Jan 03 '18 at 21:32
  • 1
    see new code in answer. Inside `tr` is `table` so `find_all("tr")` returns external `tr` (which has `table/tr/td`) and internal `tr` (which has the same `td`) so it duplicated values. – furas Jan 04 '18 at 09:20