-1

I scrape wsj site for posts about commodities and futures. The structure of HTML page leads me to finding this piece of html to get all the info I need from one post

This is this piece HTML:

  <article class="WSJTheme--story--XB4V2mLz WSJTheme--design-refresh--2eDQsiEp WSJTheme--design-refresh-4u--WkTDMafN " data-id="SB10596806121828464523104587634361262943098">
     <div class="WSJTheme--articleType--34Gt-vdG "> <span class="">Commodities</span></div>
     <div class="WSJTheme--headline--7VCzo7Ay ">
      <h2 class="WSJTheme--headline--unZqjb45 undefined ">
        <a class="" href="https://www.wsj.com/articles/a-gold-mine-takeover-highlights- 
         increasing-mining-sector-risk-11628433248"><span class="WSJTheme--headlineText-- 
         He1ANr9C ">A Gold Mine Takeover Highlights Increasing Mining-Sector Risk </span></a> 
      </h2>
    </div>
    <p class="WSJTheme--summary--lmOXEsbN typography--serif--1CqEfjrc ">
      <span class="WSJTheme- -summaryText--2LRaCWgJ ">
        Kyrgyzstan’s nationalization of Centerra Gold’s large mining operation is one of the 
        most brazen moves in recent years by a country to assert control over valuable natural 
        resources, mining and legal experts say.
      </span>
      <span class="WSJTheme--stats--2HBLhVc9 "></span></p>
    <div class="">
      <p class="WSJTheme--byline--1oIUvtQ3 ">Jacquie McNish and Joe Wallace</p>

      <div class="WSJTheme--timestamp--2zjbypGD ">

      <p aria-label="Updated August 8, 2021" class="WSJTheme--timestamp--22sfkNDv ">August 8, 2021</p>

     </div>

My code for scraping looks like:

      def scrape(self, src):
    source = requests.get(src).text
    soup = BeautifulSoup(source, 'lxml')
    for article in soup.find_all('article'):
        headline = article.h2.a.span.text
        if not headline:
            continue
        print(headline)

After scraping all this posts I get this:

What Parents With Unvaccinated Kids Need to Know About the Delta Variant This Summer

JPMorgan, Goldman Call Time on Work-From-Home. Their Rivals Are Ready to Pounce.

Some Vaccinated People Are Dying of Covid-19. Here’s Why Scientists Aren’t Surprised. Video Shows Demolition of Miami-Area Condo Building

How the EV Industry Is Trying to Fix Its Charging Bottleneck

Watch Chinese Astronauts’ First Spacewalk Outside New Space Station

While I should get:

A Gold Mine Takeover Highlights Increasing Mining-Sector Risk etc...

That's a link to site a scrape from: https://www.wsj.com/news/markets/oil-gold-commodities-futures

1 Answers1

2

You need to pass in your HTTP headers. check out this link and this. pass in Accept-Language and User-Agent as headers.

header = {
"Accept-Language": "en-US,en;q=0.9,ta;q=0.8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"

}



def scrape(src):
source = requests.get(src,headers=header).text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
    headline = article.h2.a.span.text
    if not headline:
        continue
    print(headline)
MendelG
  • 14,885
  • 4
  • 25
  • 52