I'm trying to extract data from a website for the purposes of finishing a small data analysis project. Here is the the HTML source that I'm dealing with (All the divs that I want to extract data from have the same structure exactly).
url = "https://www.rystadenergy.com/newsevents/news/press-releases/"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="Oil Markets" data-month="11" data-year="2020">
<a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/prices-at-stake-if-opec-increases-output-in-january-a-200-million-barrel-glut-will-build-through-may/">
<small class="mb-3 d-flex flex-wrap justify-content-between">
<time datetime="2020-11-30">
November 30, 2020
</time>
<span>
Oil Markets
</span>
</small>
<h5 class="mb-0">
Prices at stake: If OPEC+ increases output in January, a 200 million-barrel glut will build through May
</h5>
</a>
</div>
Fortunately, I succeeded in extracting the titles of the articles and their publishing dates. I've first created bs4.element.ResultSet
and then wrote a loop in order to iterate through each date as follow and it worked properly (same happened for the title of the article).
divs = soup.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')
dates = []
for container in divs:
date = container.find('time')
dates.append(date['datetime'])
However, when I tried to extract the category of each article, which lives between <span></span>
(Oil Markets in my case), I've got an error that 'NoneType' object has no attribute 'text
. The code I used to do so was:
for container in divs:
topic = container.find('span').text
topics.append(topic)
The weird thing here is that when I print(topics)
, I've got a list contains more elements than the actual ones (almost 800 element and sometimes even more) and the elements were mixed and include strings and bs4 element tags at the same time. Here is a snapshot of the list I've got:
</span>, <span> E&P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',
My aim is to extract the categories as a list of strings (they should be 207 categories combined) in order to populate them later in a data frame along with dates and title.
I've tried the solutions here and here and here but with no success. I was wondering if someone can help me to fix this problem.