0

I am trying to get the articlebody parts of the html. I can get the components of the script tag but not the "articleBody" part of this tag. Please find my code below:

import requests
from bs4 import BeautifulSoup as bs
from lxml import etree
url_req="https://www.bbc.com/news/live/world-europe-61792068"
response=requests.get(url=url_req,verify=True)
soup=bs(response.text, "lxml")
soup = soup.encode('ascii', 'ignore').decode('ascii')
with open('file.xml', 'w') as f:
    f.write(soup)
with open("file.xml") as fp:
   soup = bs(fp,"lxml")
df=soup.find_all("script")

Below is the result that I am hoping to get: There are many articleBody parts under the script tags. I want to get only article body parts output after running the code -without any other parts of the script tags.

For example:

"articleBody":"Russia's aggression in Ukraine is a game-changer, Nato Secretary General Jens Stoltenberg has said. Stoltenberg has been speaking in Brussels where defence ministers from member countries of the military alliance and a handful of other allies have been meeting to discuss the situation in Ukraine. He says progress has been made in many areas and, in a meeting with the Ukrainian defence minister last night, they discussed the "imperative need for our continued support as Russia conducts a relentless war of attrition against Ukraine". Stoltenberg says Ukraine's allies have announced additional assistance, "including much-needed heavy weapons and long range systems" and also discussed plans to support the country for the longer term and to step up Nato's "presence,

larsks
  • 277,717
  • 41
  • 399
  • 399
Babiqowski
  • 11
  • 4
  • 2
    Provide what exactly you want to get. Provide a sample of what you are getting and what you expect to get. – iamtrappedman Sep 18 '22 at 13:10
  • 1
    It looks as if all the content you want is contained within the ` – larsks Sep 18 '22 at 13:27

1 Answers1

1

This is how you can get all the articlebody from this url.

import requests
from bs4 import BeautifulSoup as bs
import json

url_req="https://www.bbc.com/news/live/world-europe-61792068"
response=requests.get(url=url_req,verify=True)

soup=bs(response.text, "html.parser")
articles_html = soup.find("script", {"type": "application/ld+json", "class":"qa-seo-data-script"})

articles = json.loads(articles_html.text)

for i in articles['liveBlogUpdate']:
    print(i['articleBody'])
iamtrappedman
  • 176
  • 1
  • 7