-1

I am a python programmer. I want to extract all of table data in below link by beautifulsoup library.

This is the link: https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF[enter image description here]1

  • As a Python programmer you'll probably be familiar with the 'requests' module. That's your best starting point for this exercise –  Aug 23 '21 at 15:22

1 Answers1

0

You'll want to look into web scraping tutorials.

Here's one to get you started: https://realpython.com/python-web-scraping-practical-introduction/

This kind of thing can get a little complicated with complex mark-up, and I'd say the provided link in the question post qualifies as slightly complex mark-up, but basically, you want to find the container div object with "Pb(10px) Ovx(a) W(100%)" classes or table container with data-test attribute of "historical-prices". Drill down to the mark-up data you need from there.

HOWEVER, if you insist on using BeautifulSoup library, here's a tutorial for that: https://realpython.com/beautiful-soup-web-scraper-python/ Scroll down to step 3: "Parse HTML Code With Beautiful Soup"

install the library: python -m pip install beautifulsoup4

Then, use the following code to scrape the page:

import requests
from bs4 import BeautifulSoup

URL = "https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

Then, find the table container with data-test attribute of "historical-prices" which I mentioned earlier:

results = soup.find(attrs={"data-test" : "historical-prices"})

Thanks to this other StackOverflow post for this info on the attrs parameter: Extracting an attribute value with beautifulsoup

From there, you'll want to drill down. I'm not really sure how to do this step properly, as I never did this in Python before, but there are multiple ways to go about doing this. My preferred way would be to use the find method or findAll method on the initial result set:

result_set = results.find("tbody", recursive=False).findAll("tr")

Alternatively, you may be able to use the deprecated findChildren method:

result_set = results.findChildren("tbody", recursive=False)
result_set2 = result_set.findChildren("tr", recursive=False)

You may require a results set loop for each drill-down. The page you mentioned doesn't make things easy, mind you. You'll have to drill down multiple times to find the proper tr elements. Of course, the above code is only example code, not properly tested.