How can I scrape the content of a news, based on its title?

Question

My little personal utility built for fun. I have a Listbox where the titles and news time are scraped from 2 links and printed in the Listbox after clicking on the "View Title" button. This works correctly. All ok!

Now I would like to select the newspaper title from the Listbox, click on the "View Content" button, and view the news content in a multiline textbox. So I would like to view the content of the news of the selected title in the textbox below. I specify that the title is the same as the link of the news content. But I have a problem with the function to build this:

def content():
    if title.select:

        #click on title-link
        driver.find_element_by_tag_name("title").click()

        #Download Content to class for every title
        content_download =(" ".join([span.text for span in div.select("text mbottom")]))

        #Print Content in textobox
        textbox_download.insert(tk.END, content_download)

So I imagined that to get this, we would have to simulate clicking on the title of the news to open it (in html it is title), then select the text of the content (in html it is text mbottom) and then copy it in the tetbox of my file. It should be so? What are you saying? Obviously I have poorly written the code and it doesn't work. I'm not very good at scraping. Could anyone help me? Thank you

The complete code is this (is executable correctly and scrapes titles and now. I don't call the content function in the button). Aside from the above function, the code is working good and fetches the title and news time

from tkinter import *
from tkinter import ttk
import tkinter as tk
import sqlite3
import random
import tkinter.font as tkFont
from tkinter import ttk

window=Tk()
window.title("x")
window.geometry("800x800")

textbox_title = tk.Listbox(window, width=80, height=16, font=('helvetic', 12), selectbackground="#960000", selectforeground="white", bg="white") #prima era self.tutti_pronostici, per far visualizzare le chiamate dall'altra finestra
textbox_title.place(x=1, y=1)

textbox_download = tk.Listbox(window, width=80, height=15, font=('helvetic', 12), selectbackground="#960000", selectforeground="white", bg="white") #prima era self.tutti_pronostici, per far visualizzare le chiamate dall'altra finestra
textbox_download.place(x=1, y=340)

#Download All Titles and Time
def all_titles():

    allnews = []

    import requests
    from bs4 import BeautifulSoup

    # mock browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }


    #ATALANTA
    site_atalanta = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
    soup = BeautifulSoup(site_atalanta.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for each in news:
        for div in each.find_all("div"):
            time= (div.find('span', attrs={'class': 'hh serif'}).text)
            title=(" ".join([span.text for span in div.select("a > span")]))

            news = (f" {time} {'ATALANTA'}, {title} (TMW)")
            allnews.append(news)        


    #BOLOGNA
    site_bologna = requests.get('https://www.tuttomercatoweb.com/bologna/', headers=headers)
    soup = BeautifulSoup(site_bologna.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for each in news:
        for div in each.find_all("div"):
            time= (div.find('span', attrs={'class': 'hh serif'}).text)
            title=(" ".join([span.text for span in div.select("a > span")]))

            news = (f" {time} {'BOLOGNA'}, {title} (TMW)")
            allnews.append(news)           
                            

    allnews.sort(reverse=True)

    for news in allnews:
        textbox_title.insert(tk.END, news)

#Download Content of News
def content():
    if titolo.select:

        #click on title-link
        driver.find_element_by_tag_name("title").click()

        #Download Content to class for every title
        content_download =(" ".join([span.text for span in div.select("text mbottom")]))

        #Print Content in textobox
        textbox_download.insert(tk.END, content_download)



button = tk.Button(window, text="View Titles", command= lambda: [all_titles()])
button.place(x=1, y=680)

button2 = tk.Button(window, text="View Content", command= lambda: [content()])
button2.place(x=150, y=680)

window.mainloop()

I would first scrape all data - title, time. content - and later display it. — furas, Mar 29 '22 at 22:18
@furas In my code there are only the titles of 2 web pages, but in reality I would like to add a hundred or more. Are you saying you want to first scrape previously content (in addition to title and time)? It will be a very long scraping if you add other sites. I would like to scrape only the title content I select in the Listbox, then scrape only one (1) content when called by the button. I select a title, click on the button and scrape its content — , Mar 29 '22 at 22:28
if all is on the same page then it would be simpler and doesn't have to be longer. To get content you have to load again page in browser and it may take longer time. And you have to remeber if this title need to load from `.../ATLANTA/` or `.../BOLOGNE/` but you always search titles on last loaded page - `BOLOGNE` - even if you search title for `ATLANTA` — furas, Mar 29 '22 at 22:32
you could at least keep title with url to page with content - and then you could load directly page with content without loading again page `.../ATLANTA/` , search title, click it and load page with content — furas, Mar 29 '22 at 22:34
@furas Can you give me an executable example of what you are saying? I'm not sure I understand what you say. Thanks (obviously in case of solution, I will vote and accept the answer) — , Mar 29 '22 at 22:37

furas · Accepted Answer · 2022-03-30T18:36:59.673

When you get title and time then you could directly get link to page with details - and keep them as pair.

            news = f" {time} '{place}', {title} (TMW)"
            link = div.find('a')['href']

            results.append( [news, link] )

and later you can display only news but when you select title then you can get index and get link from allnews and directly download it - using requests instead driver

def content():
    # tuple with indexes of all selected titles
    selection = listbox_title.curselection()
    print('selection:', selection)

    if selection:
        
        item = allnews[selection[-1]]
        print('item:', item)

        url = item[1]
        print('url:', url)

To select full news you have to use select(".text.mbottom") with dots.

And to display news it would be better to use Text() instead Listbox()

Because you run the same code for ATALANTA and BOLOGNA so I moved this code to function get_data_for(place) and now I can even use for-loop to run it for more places.

for place in ['atalanta',  'bologna']: 
    results = get_data_for(place)
    allnews += results

Full working code (1) - I tried to keep only important elements.

I used pack() instead of place() beacause it allows to resize window and it will resize also Listbox() and Text()

import tkinter as tk   # PEP8: `import *` is not preferred
from tkinter import ttk
import requests
from bs4 import BeautifulSoup

# PEP8: all imports at the beginning

# --- functions ---   # PEP8: all functions directly after imports

def get_data_for(place):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }
    
    results = []
    
    response = requests.get(f'https://www.tuttomercatoweb.com/{place}/', headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for each in news:
        for div in each.find_all("div"):
            time = div.find('span', attrs={'class': 'hh serif'}).text
            title = " ".join(span.text for span in div.select("a > span"))
            news = f" {time} {place.upper()}, {title} (TMW)"
            link = div.find('a')['href']
            results.append( [news, link] )
    
    return results

def all_titles():
    global allnews  # inform function to use global variable instead of local variable

    allnews = []

    for place in ['atalanta',  'bologna']: 
        print('search:', place)
        results = get_data_for(place)
        print('found:', len(results))
        allnews += results

    allnews.sort(reverse=True)

    listbox_title.delete('0', 'end')

    for news in allnews:
        listbox_title.insert('end', news[0])

#Download Content of News
def content():
    # tuple
    selection = listbox_title.curselection()
    print('selection:', selection)

    if selection:
        
        item = allnews[selection[-1]]
        print('item:', item)
        url = item[1]
        print('url:', url)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
        }
             
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        content_download = "\n".join(item.get_text() for item in soup.select("div.text.mbottom"))

        text_download.delete('1.0', 'end') # remove previous content)
        text_download.insert('end', content_download)

# --- main ---

allnews = []  # global variable with default value at start

window = tk.Tk()
window.geometry("800x800")

listbox_title = tk.Listbox(window, selectbackground="#960000", selectforeground="white", bg="white")
listbox_title.pack(fill='both', expand=True, pady=5, padx=5)

text_download = tk.Text(window, bg="white")
text_download.pack(fill='both', expand=True, pady=0, padx=5)

buttons_frame = tk.Frame(window)
buttons_frame.pack(fill='x')

button1 = tk.Button(buttons_frame, text="View Titles", command=all_titles)  # don't use `[]` to execute functions
button1.pack(side='left', pady=5, padx=5)

button2 = tk.Button(buttons_frame, text="View Content", command=content)   # don't use `[]` to execute functions
button2.pack(side='left', pady=5, padx=(0,5))

window.mainloop()

Result:

EDIT:

Problem with sorting: today's titles are at the end of list but they should be at the beginning - all because they are sorted using only time but they would need to be sorted using date time or number time.

You would enumerate every tcc-list-news and then every day would have own number and they would sort (almost) correctly. because you want to sort in reverse order then you may need -number instead of number to get correct order.

    for number, each in enumerate(news):
        for div in each.find_all("div"):
            time  = div.find('span', attrs={'class': 'hh serif'}).text
            title = " ".join(span.text for span in div.select("a > span"))
            news  = f" {time} {place.upper()}, {title} (TMW)"
            link  = div.find('a')['href']
            results.append( [-number, news, link] )

and after sorting

    for number, news, url in allnews:
        listbox_title.insert('end', news)

Full working code (2)

import tkinter as tk   # PEP8: `import *` is not preferred
from tkinter import ttk
import requests
from bs4 import BeautifulSoup

# PEP8: all imports at the beginning

# --- functions ---   # PEP8: all functions directly after imports

def get_data_for(place):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }
    
    results = []
    
    response = requests.get(f'https://www.tuttomercatoweb.com/{place}/', headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for number, each in enumerate(news):
        for div in each.find_all("div"):
            time  = div.find('span', attrs={'class': 'hh serif'}).text
            title = " ".join(span.text for span in div.select("a > span"))
            news  = f" {time} {place.upper()}, {title} (TMW)"
            link  = div.find('a')['href']
            results.append( [-number, news, link] )
    
    return results

def all_titles():
    global allnews  # inform function to use global variable instead of local variable

    allnews = []

    for place in ['atalanta',  'bologna']: 
        print('search:', place)
        results = get_data_for(place)
        print('found:', len(results))
        allnews += results

    allnews.sort(reverse=True)

    listbox_title.delete('0', 'end')
    
    for number, news, url in allnews:
        listbox_title.insert('end', news)

#Download Content of News
def content():
    # tuple
    selection = listbox_title.curselection()
    print('selection:', selection)

    if selection:
        
        item = allnews[selection[-1]]
        print('item:', item)
        url = item[2]
        print('url:', url)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
        }
             
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        content_download = "\n".join(item.get_text() for item in soup.select("div.text.mbottom"))

        text_download.delete('1.0', 'end') # remove previous content)
        text_download.insert('end', content_download)

# --- main ---

allnews = []  # global variable with default value at start

window = tk.Tk()
window.geometry("800x800")

listbox_title = tk.Listbox(window, selectbackground="#960000", selectforeground="white", bg="white")
listbox_title.pack(fill='both', expand=True, pady=5, padx=5)

text_download = tk.Text(window, bg="white")
text_download.pack(fill='both', expand=True, pady=0, padx=5)

buttons_frame = tk.Frame(window)
buttons_frame.pack(fill='x')

button1 = tk.Button(buttons_frame, text="View Titles", command=all_titles)  # don't use `[]` to execute functions
button1.pack(side='left', pady=5, padx=5)

button2 = tk.Button(buttons_frame, text="View Content", command=content)   # don't use `[]` to execute functions
button2.pack(side='left', pady=5, padx=(0,5))

window.mainloop()

BTW

Because you sort in reverse order so you get 00:30 bologna before 00:30 atalanta - to get 00:30 atalanta before 00:30 bologna you would have to keep time, place as separated values and use key= in sort() to assign function which would reverse only time but not place and number. Maybe it would be simpler to put all in pandas.DataFrame which has better methot to sort it.

Version with pandas.DataFrame and sort_values()

df = df.sort_values(by=['number', 'time', 'place', 'title'], ascending=[True, False, True, True])

If you change order 'title', 'place' instead of 'place', 'title' then you get the same titles together.

Full working code (3)

import tkinter as tk   # PEP8: `import *` is not preferred
from tkinter import ttk
import requests
from bs4 import BeautifulSoup
import pandas as pd

# PEP8: all imports at the beginning

# --- functions ---   # PEP8: all functions directly after imports

def get_data_for(place):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }
    
    results = []
    
    response = requests.get(f'https://www.tuttomercatoweb.com/{place}/', headers=headers)
    print('url:', response.url)
    print('status:', response.status_code)
    #print('html:', response.text[:1000])
    
    soup = BeautifulSoup(response.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for number, each in enumerate(news):
        for div in each.find_all("div"):
            time  = div.find('span', attrs={'class': 'hh serif'}).text
            title = " ".join(span.text for span in div.select("a > span"))
            news = f" {time} {place.upper()}, {title} (TMW)"
            link  = div.find('a')['href']
            results.append( [number, time, place, title, news, link] )
    
    return results

def all_titles():
    global df
    
    allnews = []  # local variable

    for place in ['atalanta',  'bologna']: 
        print('search:', place)
        results = get_data_for(place)
        print('found:', len(results))
        allnews += results
        text_download.insert('end', f"search: {place}\nfound: {len(results)}\n")

    df = pd.DataFrame(allnews, columns=['number', 'time', 'place', 'title', 'news', 'link'])
    df = df.sort_values(by=['number', 'time', 'place', 'title'], ascending=[True, False, True, True])
    df = df.reset_index()
                      
    listbox_title.delete('0', 'end')
    
    for index, row in df.iterrows():
        listbox_title.insert('end', row['news'])

#Download Content of News
def content():
    # tuple
    selection = listbox_title.curselection()
    print('selection:', selection)

    if selection:
        
        item = df.iloc[selection[-1]]
        #print('item:', item)
        
        url = item['link']
        #print('url:', url)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
        }
             
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        content_download = "\n".join(item.get_text() for item in soup.select("div.text.mbottom"))

        text_download.delete('1.0', 'end') # remove previous content)
        text_download.insert('end', content_download)

# --- main ---

df = None

window = tk.Tk()
window.geometry("800x800")

listbox_title = tk.Listbox(window, selectbackground="#960000", selectforeground="white", bg="white")
listbox_title.pack(fill='both', expand=True, pady=5, padx=5)

text_download = tk.Text(window, bg="white")
text_download.pack(fill='both', expand=True, pady=0, padx=5)

buttons_frame = tk.Frame(window)
buttons_frame.pack(fill='x')

button1 = tk.Button(buttons_frame, text="View Titles", command=all_titles)  # don't use `[]` to execute functions
button1.pack(side='left', pady=5, padx=5)

button2 = tk.Button(buttons_frame, text="View Content", command=content)   # don't use `[]` to execute functions
button2.pack(side='left', pady=5, padx=(0,5))

window.mainloop()

EDIT:

Last version with

ScrolledText
Scrollbar
double click title to see news
requests_cache to read page with news only once even if you click it many times (it may need to install SQLite)

Full working code (4)

import tkinter as tk   # PEP8: `import *` is not preferred
from tkinter import ttk
from tkinter.scrolledtext import ScrolledText  # https://docs.python.org/3/library/tkinter.scrolledtext.html
import requests
import requests_cache  # https://github.com/reclosedev/requests-cache
from bs4 import BeautifulSoup
import pandas as pd

# PEP8: all imports at the beginning

# --- functions ---   # PEP8: all functions directly after imports

def get_data_for(place):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
    }

    results = []

    response = requests.get(f'https://www.tuttomercatoweb.com/{place}/', headers=headers)
    print('url:', response.url)
    print('status:', response.status_code)
    #print('html:', response.text[:1000])

    soup = BeautifulSoup(response.content, 'html.parser')

    news = soup.find_all('div', attrs={"class": "tcc-list-news"})

    for number, each in enumerate(news):
        for div in each.find_all("div"):
            time  = div.find('span', attrs={'class': 'hh serif'}).text
            title = " ".join(span.text for span in div.select("a > span"))
            news = f" {time} {place.upper()}, {title} (TMW)"
            link  = div.find('a')['href']
            results.append( [number, time, place, title, news, link] )

    return results

def all_titles():
    global df

    allnews = []  # local variable

    for place in ['atalanta',  'bologna']:
        print('search:', place)
        results = get_data_for(place)
        print('found:', len(results))
        allnews += results
        text_download.insert('end', f"search: {place}\nfound: {len(results)}\n")

    df = pd.DataFrame(allnews, columns=['number', 'time', 'place', 'title', 'news', 'link'])
    df = df.sort_values(by=['number', 'time', 'place', 'title'], ascending=[True, False, True, True])
    df = df.reset_index()

    listbox_title.delete('0', 'end')

    for index, row in df.iterrows():
        listbox_title.insert('end', row['news'])

def content(event=None):   # `command=` executes without `event`, but `bind` executes with `event` - so it needs default value
    # tuple
    selection = listbox_title.curselection()
    print('selection:', selection)

    if selection:

        item = df.iloc[selection[-1]]
        #print('item:', item)

        url = item['link']
        #print('url:', url)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
        }

        # keep page in database `SQLite` 
        # https://github.com/reclosedev/requests-cache
        # https://sqlite.org/index.html
        session = requests_cache.CachedSession('titles')
        response = session.get(url, headers=headers)
        #response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        content_download = "\n".join(item.get_text() for item in soup.select("div.text.mbottom"))

        text_download.delete('1.0', 'end') # remove previous content)
        text_download.insert('end', content_download)

# --- main ---

df = None

window = tk.Tk()
window.geometry("800x800")

# ---
# [Tkinter: How to display Listbox with Scrollbar — furas.pl](https://blog.furas.pl/python-tkitner-how-to-display-listbox-with-scrollbar-gb.html)

frame_title = tk.Frame(window)
frame_title.pack(fill='both', expand=True, pady=5, padx=5)

listbox_title = tk.Listbox(frame_title, selectbackground="#960000", selectforeground="white", bg="white")
listbox_title.pack(side='left', fill='both', expand=True)

scrollbar_title = tk.Scrollbar(frame_title)
scrollbar_title.pack(side='left', fill='y')

scrollbar_title['command'] = listbox_title.yview
listbox_title.config(yscrollcommand=scrollbar_title.set)

listbox_title.bind('<Double-Button-1>', content)  # it executes `content(event)`

# ----

text_download = ScrolledText(window, bg="white")
text_download.pack(fill='both', expand=True, pady=0, padx=5)

# ----

buttons_frame = tk.Frame(window)
buttons_frame.pack(fill='x')

button1 = tk.Button(buttons_frame, text="View Titles", command=all_titles)  # don't use `[]` to execute functions
button1.pack(side='left', pady=5, padx=5)

button2 = tk.Button(buttons_frame, text="View Content", command=content)   # don't use `[]` to execute functions
button2.pack(side='left', pady=5, padx=(0,5))

window.mainloop()

Fantastic. I noticed that if I click on View Titles a second time (for example to refresh), the titles are copied twice. For example search ATALANTA 69, search BOLOGNA 69 and then again ATALANTA 69, search BOLOGNA 69. Is there a way to remove the titles of previous searches and display the titles of a subsequent search? — , Mar 29 '22 at 23:32
it would need to add function to delete all items from `Listbox()` - similar to function which I use to remove previous text from `Text()` — furas, Mar 29 '22 at 23:34
BTW: [Tkinter: How to display Listbox with Scrollbar](https://blog.furas.pl/python-tkitner-how-to-display-listbox-with-scrollbar-gb.html) — furas, Mar 29 '22 at 23:40
I'm sorry if I answer intermittently, in pauses, but I'm studying your code for a moment before answering. I noticed a strange thing. The stocks are scraped up to 23.38 yesterday, while those of today do not scrape. For example I put the page atalanta ( https://ibb.co/mGRR842 ) I also checked the html and it always seems to be tcc-list-news. Can you see why it doesn't scrape today's titles? Can you solve this strange problem? Thank you — , Mar 30 '22 at 00:29
it scrapes all titles but you sort all news using only `time` so all days are mixed and today's news are at the end of list. You would have to keep `date time` (like `2022.03.30 02:46`) to sort correctly. OR you shouldn't sort and use different method to get values for `ATALANTA` and `BOLOGNA`. If you remove `.sort()` then you get titles in correct order but first you will correctly sorted only for `ATALANTA` and later correctly sorted but only for `BOLOGNA` — furas, Mar 30 '22 at 00:45
I added code which use `-number` to keep days in correct order. — furas, Mar 30 '22 at 01:05
Thank you so much for the additional things you added. Now it's late. Give me time to try the code tomorrow (which will obviously be well written) and I'll get back to you tomorrow. Thanks, see you tomorrow — , Mar 30 '22 at 01:40
if you will have new problem then create new question on new page. Stackoverflow is not forum and in one question should be resolved only one problem. — furas, Mar 30 '22 at 02:28
No, I don't have a new problem. I didn't say anything. I just thanked you :) But now I have looked carefully at your code, but there are 2 problems related to the main question about content, probably distracting because last night it was late and so you were tired. I have saved each of your different modified and updated versions. The problems seem to have arisen after your change with the time. I continue in subsequent comments — , Mar 30 '22 at 12:25
PROBLEM 1: From your version n. 5 or 6, content no longer scrapes. I click on the button but I get an error. The error started from when you fixed the time (last or last 2 versions before using Pandas). Content scraping doesn't work with Pandas either. While your early versions up to 4 or 5 (before adjusting the time) worked fine with content. PROBLEM 2: In the Last 2 versions with Pandas, content is not scraped, while with the previous versions at the date change, content was scraped. — , Mar 30 '22 at 12:27
I'm testing your various versions right now. Up to version 4 everything is ok. Problems with content scraping start from when you changed the time. Can you fix the content scraping also in the version with Pandas and also in the version without Pandas but with the time adjusted? It was your mistake, probably due to distraction and fatigue to night. Thank you — , Mar 30 '22 at 12:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243449/discussion-between-furas-and-jas-99). — furas, Mar 30 '22 at 15:58
Excuse me, but only now, after some time, I realize that in "for place" you write 'lazio', therefore for place in ['lazio'], your code does not work. It does not print news headlines. The html and css code in the page is the same, it hasn't been changed. So the site.com/lazio page is the same as when you created the code. I get this jsonStr = re.search('({.*})', jsonStr).group(1) AttributeError: 'NoneType' object has no attribute 'group'. Could you take a look at your code again please? The problem in your code had been around for some time, but I'm only realizing it now. Thank you — , Jul 17 '22 at 07:59
it answer has 3 months and 3 months ago code was working for me - maybe page changed structure and now this needs new code. OR server detects you and send different HTML or send warning (or Captcha) for bots/scripts. First you should check what you get in `response.text` — furas, Jul 17 '22 at 10:29
code (last version 4) stilll works for me. what is `jsonStr`? I don't have it in code. And if you get JSON data then why you don't use module `json` for this? In some `re` systems `.*` doesn't catch `new line` (it works with every line separatelly) - but JSON may send data formatted as `{ new_line data new_line }`. It may need som `flag` `re.DOTALL` in `search(..., re.DOTALL)` to catch also `new line` — furas, Jul 17 '22 at 10:36
PART 1. It had been a long time and I didn't remember well. After your answer, I created another question (https://stackoverflow.com/questions/71686996/scrape-time-title-and-content-not-from-a-news-list-but-from-cover-and-column-c) using your code, but asking if they could additionally help me scrape the cover story and right bar news headlines. Your code worked fine, and the code they suggested to me later also worked fine. I don't know why, only 1 team out of 20 (pages) was causing me an error. — , Jul 22 '22 at 22:40
PART 2. After a few hours I no longer found the error and scraped everything correctly. I wanted to investigate and ask for help to avoid this little problem in the future. I did not understand what it is, considering that I have had the problem only 1 time in 20. Are there any problems for you? I reopened that post, so I gladly accept your answer as a solution. Thank you — , Jul 22 '22 at 22:42
sometimes servers may have some problem to generate data (ie. too many users at the same time). And sometimes internet may have problem to send data from one computer to another (ie. broken connection). And this may need to write code which repeate your request few times before it resigns. Other problem can be when server has complex system to detect bots/scripts - it may intentionally block your request to make you like harder. It may need to use random sleep between requests (because real human can't read hundreds pages in few milliseconds) or use proxy servers to simulate many users. — furas, Jul 23 '22 at 13:39

How can I scrape the content of a news, based on its title?

1 Answers1