Beautifulsoup Missing ID

Question

I am trying to scrape class div id="ideas_body" from this site, but it seems to be missing. I have tried the different parsers linked to in this post (Missing parts on Beautiful Soup results), but none have been successful.

Here is my code:

import requests
from bs4 import BeautifulSoup
import lxml

# Set Soup
url = 'https://www.com/ideas#'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)

and the unsuccessful parsers I have tried:

soup = BeautifulSoup(page.content, 'lxml-xml')
soup = BeautifulSoup(page.content, 'html.parser')
soup = BeautifulSoup(page.content, 'html.parser-xml')
soup = BeautifulSoup(page.content, 'html5lib')

So how can I parse this ID in order to scrape it?

I don't see `class="ideas_body"` in the HTML. I see `id="ideas_body"`. — Barmar, Jul 05 '19 at 19:03
probably this page use JavaScript to add this element and then `BeautifulSoup` is useless because it can't run JavaScript. — furas, Jul 05 '19 at 19:04
Are you sure the DIV is missing? It's there but it's empty, presumably because it fills it in with JavaScript. — Barmar, Jul 05 '19 at 19:05
Instead of scraping the web page, call the API that the web page uses to fill in the DIV. — Barmar, Jul 05 '19 at 19:07
The usual reason to scrape web pages is because there's no equivalent API. But obviously there is in this case. — Barmar, Jul 05 '19 at 19:18

andreilozhkin · Accepted Answer · 2019-07-05T21:06:49.663

1

As was mentioned earlier in the comments there is no need to scrape. You just can call an API to get the data you need.

If you need more than 30 results change 'per_page' in form_data.

import requests


form_data = {'type': 'idea',
             'show': 'all',
             'sort': 'new',
             'per_page': 30,
             'gotodate': '04/06/2019',
             'ls': 'all',
             'loc': 'all',
             'marketcap_l': 0,
             'shorten_name': 1
             }

response = requests.post('https://www.valueinvestorsclub.com/messages/loadmsgs', data=form_data)

ideas = response.json()['result']

Hope it helps!

edited Jul 05 '19 at 21:06

answered Jul 05 '19 at 19:26

andreilozhkin

495
4
15

This is close, though it's the wrong endpoint. Should be ~/ideas/loadideas rather than ~/messages/loadmsgs (and thus different ````form_data````) . – user53526356 Jul 07 '19 at 13:54

Beautifulsoup Missing ID

1 Answers1