web scrape nested text features

Question

I'm trying to scrape the text for the axis of an online plot and some of the features associated with it such as colour of the text but very rarely use scraping so would really appreciate a bit of help. This is probably an easy fix for anyone who regularly uses scrapers. Here is my code:

from bs4 import BeautifulSoup
import requests

def get_IPF_transcriptome_groups():

url = "https://research.cchmc.org/pbge/lunggens/lungDisease/celltype_IPF.html?cid=1"
r = requests.get(url)
data=r.text
soup = BeautifulSoup(data)


for d in soup.find('div', attrs={'id':'wrapper'}).find(
        'div', attrs={'class':'content'}).find(
                'div', attrs={'id':'ResPanel'}).find(
                        'table', attrs={'id':'maintable'}).find(
                                'tbody'):
    print(d)

I get an error:

    'tbody'):

    TypeError: 'NoneType' object is not iterable

I think the code can't get through the table body. The actual text I am looking to parse is burried a bit deeper that this through several other tags including 'div', 'td','tr','g', etc and looks like the folowing:

<tspan style="fill:#006600;font-size:7px;">CC002_33_N709_S503_C10</tspan>

where 'CC002_33_N709_S503_C10' is a sample reference and '#006600' refers to a colour. There are (i think) 540 lines like this. Would be really great if anyone can help? Many thanks

EDIT BASED ON RESPONSE FROM Uday:

Thanks for the suggestion, I've built 'findAll' into it and used an index to retrieve the next piece. This suggestion here mentioned removing the 'tbody' tag as it may not be part of the source code. Just adding 'tspan' doesn't seem to return what i need though. Here is my updated code:

    for d in soup.find('div', attrs={'id':'wrapper'}).find(
        'div', attrs={'class':'content'}).find(
                'div', attrs={'id':'ResPanel'}).find(
                        'table', attrs={'id':'maintable'}).findAll(
                                'tr')[2].findAll('td')[0].find('div', attrs={'id':'sigheatmapcontainer'}): 
                                        print(d)

Any further suggestions would be really helpful?

the error `TypeError: 'NoneType' object is not iterable` says that the final returning object (`find( 'tbody')`) is not a list to run a **for loop**. Try with `find_all('tbody')` to get the table content. I think you might need `find('tbody').find_all('tspan')` — ExtractTable.com, Aug 02 '17 at 13:57
The code runs without crashing up to 'sigheatmapcontainer', but seems to be empty and if I try to find the next div: .find('div', attrs={'id':'highcharts-40'}) then it returns TypeError: 'NoneType' object is not iterable — user3062260, Aug 02 '17 at 15:17

Dan-Dev · Accepted Answer · 2017-08-02T20:20:12.597

2

The data you want is fetched from another URL with a POST request by JavaScript. (It is NOT in the source (HTML) of this web page page and it won't even render using Dryscrape.) It is returned in JSON which is nice and easy to parse. The following code will fetch all the data. How to interpret the data is another question but maybe you know better than me.

from bs4 import BeautifulSoup
import requests
import json
# Fetch the data.
url = "https://research.cchmc.org/pbge/lunggens/celltypeIPF"
r = requests.post(url, data = {'id':'1'})
data=r.text
soup = BeautifulSoup(data, "lxml")
d = soup.find('p')
# now you have the json containing all the data.
jn = json.loads(d.text)
print(json.dumps(jn, indent=2))

Outputs the raw data pretty printed.

You can parse the JSON in the way you want e.g. if you like pandas

from pandas.io.json import json_normalize
import pandas as pd
...
df = pd.DataFrame(json_normalize(jn))

edited Aug 02 '17 at 20:20

answered Aug 02 '17 at 19:17

Dan-Dev

8,957
3
38
55

Thanks very much for your insight on this, its definitely parsed out the relevant data. Would you mind to explain some of the code a bit? For example "requests.post(url, data = {'id':'1'})" differs from the "requests.get(url)" that I had originally used? Presumably this structures the data into a dict like format? Also, how could you tell that the data was fetched from another URL with a POST request by JavaScript? Is there something in the html indicating this? Lastly; the text colour changes across the x-axis, I think its somehow defined by the data in "df['celltypecnt'][0]" but am not sure? – user3062260 Aug 06 '17 at 10:01
In your original URL you had a query string ?cid=1 The JavaScript sent a POST request with {'id':'1'}, yes the POST data is a dict like structure but it is in a HTTP request. I examined the request with "Live Headers" to see what was going on behind the scenes to find the POST request. I can not interpret the data you would need to correlate it somehow or examine the JavaScript that renders it so I don't know what the colour on the X-axis is. – Dan-Dev Aug 06 '17 at 11:43
Great, thanks so much for your help, I'll read up on query strings and live headers a bit more. – user3062260 Aug 06 '17 at 21:27

web scrape nested text features

1 Answers1