0

I'm trying to extract data from a website for the purposes of finishing a small data analysis project. Here is the the HTML source that I'm dealing with (All the divs that I want to extract data from have the same structure exactly).

url = "https://www.rystadenergy.com/newsevents/news/press-releases/"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")


   <div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="Oil Markets" data-month="11" data-year="2020">
     <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/prices-at-stake-if-opec-increases-output-in-january-a-200-million-barrel-glut-will-build-through-may/">
      <small class="mb-3 d-flex flex-wrap justify-content-between">
       <time datetime="2020-11-30">
        November 30, 2020
       </time>
       <span>
        Oil Markets
       </span>
      </small>
      <h5 class="mb-0">
       Prices at stake: If OPEC+ increases output in January, a 200 million-barrel glut will build through May
      </h5>
     </a>
    </div>

Fortunately, I succeeded in extracting the titles of the articles and their publishing dates. I've first created bs4.element.ResultSet and then wrote a loop in order to iterate through each date as follow and it worked properly (same happened for the title of the article).

divs = soup.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

dates = []
for container in divs:
    date = container.find('time')
    dates.append(date['datetime'])

However, when I tried to extract the category of each article, which lives between <span></span> (Oil Markets in my case), I've got an error that 'NoneType' object has no attribute 'text. The code I used to do so was:

for container in divs:
    topic = container.find('span').text
    topics.append(topic)  

The weird thing here is that when I print(topics), I've got a list contains more elements than the actual ones (almost 800 element and sometimes even more) and the elements were mixed and include strings and bs4 element tags at the same time. Here is a snapshot of the list I've got:

</span>, <span> E&amp;P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',

My aim is to extract the categories as a list of strings (they should be 207 categories combined) in order to populate them later in a data frame along with dates and title.

I've tried the solutions here and here and here but with no success. I was wondering if someone can help me to fix this problem.

Rami_Kh
  • 21
  • 4
  • Using your html example (enclosed with `...` tags) - cannot reproduce with `soup = BeautifulSoup(html,'html.parser')` and the rest of your code. – wwii Dec 02 '20 at 18:51
  • Please post the complete Traceback. Your [mre] should show how you **make** `soup`. – wwii Dec 02 '20 at 18:55
  • 1
    I think he must provide a more complete DOM example, probably when he calls `find_all('div'.....)` he gets the desired `div`s elements and others more – Nestor Dec 02 '20 at 19:01
  • When you [catch the error](https://docs.python.org/3/tutorial/errors.html#handling-exceptions) and inspect/print relevant data in the except suite - is it what you expect. – wwii Dec 02 '20 at 19:03
  • Most likely one of the div tags in `divs` does not have a span tag *in* it. – wwii Dec 02 '20 at 19:06
  • @wwii I used results = requests.get(url) and then soup = BeautifulSoup(results.text, "html.parser"). I have already edited the code. – Rami_Kh Dec 02 '20 at 19:10
  • Still cannot reproduce: please provide a [mre]. If one of the div tags in `divs` is missing a `span` tag then `container.find('span')` will return None. – wwii Dec 02 '20 at 19:15
  • @wwii The code for dates worked properly but when it comes to category which lives in span it returns the mentioned error. The other weird thing that the length of the list is changing whenever I run the code again! – Rami_Kh Dec 02 '20 at 19:16
  • @wwii I did update the code again and included the url I want to scrap data from. The div I posted is a standard one. I mean all the data I want to extract live in the div and the other divs share the same structure exactly. – Rami_Kh Dec 02 '20 at 19:20
  • Possibly related:['NoneType' object has no attribute 'text' in BeautifulSoup](https://stackoverflow.com/questions/53980144/nonetype-object-has-no-attribute-text-in-beautifulsoup). You should search with the error message and read all the Q&A's - see if any reply. Search terms: `beautifulsoup .find error that 'NoneType' object has no attribute 'text' site:stackoverflow.com` – wwii Dec 02 '20 at 19:21
  • @wwii I did so already but with no satisfactory results, unfortunately. – Rami_Kh Dec 02 '20 at 19:23
  • @Rami_Kh I updated my answer, I hope it helps – Nestor Dec 02 '20 at 20:15
  • @Nestor Thank you so much. It worked for me. But I have two questions if you don't mind. 1- What is the purpose of try and except in the loop exactly? 2- Why does the BeautifulSoup sometimes returns duplicate values because when I tried my code the (topic) list used to return duplicate values. Thanks again – Rami_Kh Dec 03 '20 at 08:52
  • @Rami_Kh Glad it helped you, about the questions: 1 - `try...except` will catch the Exception raised when you try to access the `.text` attribute if the result of calling `find` is None (as said in answer is one way to do it). I suggest you to take a look at Python docs about it. 2 - I suppose those duplicate values are due to some articles having the same category/topic – Nestor Dec 03 '20 at 12:57

1 Answers1

1

Your code is fine, you just have to add a try..catch to avoid crashing on some articles not having categories.

Below snippet illustrates it:

from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)

Output:

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

As you see, no span element.

So in your case:

topics = []
for container in divs:
    try:
        topic = container.find('span').text.strip()
    except:
        topic = ''
    finally:
        topics.append(topic)

Note that this is just one way to do it :)

Nestor
  • 519
  • 1
  • 7
  • 21
  • Are you saying you could not reproduce the OP's problem? If so this isn't an answer. – wwii Dec 02 '20 at 18:56
  • `No is not` - if this is not an answer, you should delete it and post a comment. – wwii Dec 02 '20 at 19:01
  • Like I did - just say you could not reproduce the problem. – wwii Dec 02 '20 at 19:04
  • @wwii I have updated the post. I removed previous comments to clean the comment section :) – Nestor Dec 02 '20 at 20:06
  • Instead of the try/except you could also `continue` if topic is None... `topic = container.find(...); if topic is None: continue; topic = topic.text.strip()`. – wwii Dec 02 '20 at 20:50
  • Perhaps he wants to match articles' title against the category, so knowing that the category is empty could be handy – Nestor Dec 02 '20 at 21:03