I'm trying to scrape the text for the axis of an online plot and some of the features associated with it such as colour of the text but very rarely use scraping so would really appreciate a bit of help. This is probably an easy fix for anyone who regularly uses scrapers. Here is my code:
from bs4 import BeautifulSoup
import requests
def get_IPF_transcriptome_groups():
url = "https://research.cchmc.org/pbge/lunggens/lungDisease/celltype_IPF.html?cid=1"
r = requests.get(url)
data=r.text
soup = BeautifulSoup(data)
for d in soup.find('div', attrs={'id':'wrapper'}).find(
'div', attrs={'class':'content'}).find(
'div', attrs={'id':'ResPanel'}).find(
'table', attrs={'id':'maintable'}).find(
'tbody'):
print(d)
I get an error:
'tbody'):
TypeError: 'NoneType' object is not iterable
I think the code can't get through the table body. The actual text I am looking to parse is burried a bit deeper that this through several other tags including 'div', 'td','tr','g', etc and looks like the folowing:
<tspan style="fill:#006600;font-size:7px;">CC002_33_N709_S503_C10</tspan>
where 'CC002_33_N709_S503_C10' is a sample reference and '#006600' refers to a colour. There are (i think) 540 lines like this. Would be really great if anyone can help? Many thanks
EDIT BASED ON RESPONSE FROM Uday:
Thanks for the suggestion, I've built 'findAll' into it and used an index to retrieve the next piece. This suggestion here mentioned removing the 'tbody' tag as it may not be part of the source code. Just adding 'tspan' doesn't seem to return what i need though. Here is my updated code:
for d in soup.find('div', attrs={'id':'wrapper'}).find(
'div', attrs={'class':'content'}).find(
'div', attrs={'id':'ResPanel'}).find(
'table', attrs={'id':'maintable'}).findAll(
'tr')[2].findAll('td')[0].find('div', attrs={'id':'sigheatmapcontainer'}):
print(d)
Any further suggestions would be really helpful?