0

I'm trying to scrape a range of HTML files using Beautiful Soup, however I'm getting some really weird results, I think this is because the query is dynamic and I'm not very experienced with web scraping. If you look at the website, all I'm trying to do in this case is get all the info for the worktype but my results are far away from what I would like them to be. Please see my code below (thanks to all):

 import requests
 from bs4 import BeautifulSoup
 url = 'https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/#/viewSheet/1416'
 r = requests.get(url)
 html_doc = r.text
 soup = BeautifulSoup(html_doc)
 pretty_soup = soup.prettify()
 print(pretty_soup) 

Thanks for all that helped. I thought i share the code below , note i used a lot of references from this other post Strip HTML from strings in Python. And would not be possible without @Andrej Kesely

  url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"

import requests
import json
from pandas.io.json import json_normalize

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)
result = json_normalize(data)

result = result[['ANZSCO','Comments','Description','Group',
             'EntryRequirements','JobTitle','PhysicalMentalDemands',
             'WorkEnvironment','WorkTasks']]


 ##Lets start cleaning up the data set

 from html.parser import HTMLParser

 class MLStripper(HTMLParser):
 def __init__(self):
    self.reset()
    self.strict = False
    self.convert_charrefs= True
    self.fed = []
def handle_data(self, d):
    self.fed.append(d)
def get_data(self):
    return ''.join(self.fed)


def strip_tags(html):
   s = MLStripper()
   s.feed(html)
   return s.get_data()


list = ['WorkTasks', 'PhysicalMentalDemands','WorkTasks','Description']

for i in list:
    result[i] = result[i].apply(lambda x: strip_tags(x))

list2 = ['Comments','EntryRequirements','WorkEnvironment']

for i in list2:
    result[i] = result[i].fillna('not_available')
    result[i] = result[i].apply(lambda x: strip_tags(x))
Ian_De_Oliveira
  • 291
  • 5
  • 16

1 Answers1

2

The page is loading dynamically through Ajax. Looking at network inspector, the page loads all data from very big JSON file located at https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets. To load all job data, you can use this script:

url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"

import requests
import json

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)

# For printing all data in pretty form uncoment this line:
# print(json.dumps(data, indent=4, sort_keys=True))

for d in data:
    print(f'ID:\t{d["ID"]}')
    print(f'Job Title:\t{d["JobTitle"]}')
    print(f'Created:\t{d["Created"]}')
    print('*' * 80)

# Available keys in this JSON:
# ClassName
# LastEdited
# Created
# ANZSCO
# JobTitle
# Description
# WorkTasks
# WorkEnvironment
# PhysicalMentalDemands
# Comments
# EntryRequirements
# Group
# ID
# RecordClassName

This prints:

ID: 2327
Job Title:  Watch and Clock Maker and Repairer   
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2328
Job Title:  Web Administrator
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2329
Job Title:  Welder 
Created:    2017-07-11 11:33:52

...and so on

In the script I wrote available keys you can use to access your specific job data.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • @Andrei Kesely , could you be able to point me to a package to put the data in a dataframe format? I tried using json_normalize but a lot of the html pass through the variables for example:

    • A relevant tertiary qualification or at least five years applicable experience (ANZSCO Skill Level 1). In some cases particular experience and/or on-job training may be required.

    – Ian_De_Oliveira Jul 20 '18 at 05:59
  • @Ian_De_Oliveira I don't use dataframes/pandas so I cannot help you much with this. But try to strip the tags from data, e.g. https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python – Andrej Kesely Jul 20 '18 at 06:01
  • Kasely , again thanks a lot. I looked at the neat function before and worked but failed on the fields Comments , WorkEnviroment and EntryRequirements. Still working on it. – Ian_De_Oliveira Jul 20 '18 at 23:06
  • I manage to find the issue, i have shared my whole code above. It would not be possible without your neat approach. Tks – Ian_De_Oliveira Jul 21 '18 at 00:23