Python scraping BeautifulSoup, LXML

Question

I am trying to get the articlebody parts of the html. I can get the components of the script tag but not the "articleBody" part of this tag. Please find my code below:

import requests
from bs4 import BeautifulSoup as bs
from lxml import etree
url_req="https://www.bbc.com/news/live/world-europe-61792068"
response=requests.get(url=url_req,verify=True)
soup=bs(response.text, "lxml")
soup = soup.encode('ascii', 'ignore').decode('ascii')
with open('file.xml', 'w') as f:
    f.write(soup)
with open("file.xml") as fp:
   soup = bs(fp,"lxml")
df=soup.find_all("script")

Below is the result that I am hoping to get: There are many articleBody parts under the script tags. I want to get only article body parts output after running the code -without any other parts of the script tags.

For example:

"articleBody":"Russia's aggression in Ukraine is a game-changer, Nato Secretary General Jens Stoltenberg has said. Stoltenberg has been speaking in Brussels where defence ministers from member countries of the military alliance and a handful of other allies have been meeting to discuss the situation in Ukraine. He says progress has been made in many areas and, in a meeting with the Ukrainian defence minister last night, they discussed the "imperative need for our continued support as Russia conducts a relentless war of attrition against Ukraine". Stoltenberg says Ukraine's allies have announced additional assistance, "including much-needed heavy weapons and long range systems" and also discussed plans to support the country for the longer term and to step up Nato's "presence,

Provide what exactly you want to get. Provide a sample of what you are getting and what you expect to get. — iamtrappedman, Sep 18 '22 at 13:10
It looks as if all the content you want is contained within the ` — larsks, Sep 18 '22 at 13:27

score 1 · Answer 1 · answered Sep 18 '22 at 13:59

This is how you can get all the articlebody from this url.

import requests
from bs4 import BeautifulSoup as bs
import json

url_req="https://www.bbc.com/news/live/world-europe-61792068"
response=requests.get(url=url_req,verify=True)

soup=bs(response.text, "html.parser")
articles_html = soup.find("script", {"type": "application/ld+json", "class":"qa-seo-data-script"})

articles = json.loads(articles_html.text)

for i in articles['liveBlogUpdate']:
    print(i['articleBody'])

Python scraping BeautifulSoup, LXML

1 Answers1