I'm trying to scrape a range of HTML files using Beautiful Soup, however I'm getting some really weird results, I think this is because the query is dynamic and I'm not very experienced with web scraping. If you look at the website, all I'm trying to do in this case is get all the info for the worktype but my results are far away from what I would like them to be. Please see my code below (thanks to all):
import requests
from bs4 import BeautifulSoup
url = 'https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/#/viewSheet/1416'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
pretty_soup = soup.prettify()
print(pretty_soup)
Thanks for all that helped. I thought i share the code below , note i used a lot of references from this other post Strip HTML from strings in Python. And would not be possible without @Andrej Kesely
url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"
import requests
import json
from pandas.io.json import json_normalize
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)
result = json_normalize(data)
result = result[['ANZSCO','Comments','Description','Group',
'EntryRequirements','JobTitle','PhysicalMentalDemands',
'WorkEnvironment','WorkTasks']]
##Lets start cleaning up the data set
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
list = ['WorkTasks', 'PhysicalMentalDemands','WorkTasks','Description']
for i in list:
result[i] = result[i].apply(lambda x: strip_tags(x))
list2 = ['Comments','EntryRequirements','WorkEnvironment']
for i in list2:
result[i] = result[i].fillna('not_available')
result[i] = result[i].apply(lambda x: strip_tags(x))