I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
/storage/mediabank/yKsfiyjR/demo13.xls
You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls
How should I do it? Here is the previously reproduced code:
import requests
from bs4 import BeautifulSoup
URL = "https://rosstat.gov.ru/folder/12781"
responce = requests.get(URL).text
soup = BeautifulSoup(responce, 'lxml')
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")
list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')
sources = []
for text_block_row in list_info_block_row:
new_list = []
title_element_row = text_block_row.find('div', class_='document-list__item-title')
preprocessing_title = title_element_row.text.strip()
link_element_row = text_block_row.find('a').get('href')
new_list.append(preprocessing_title)
new_list.append(link_element_row)
print(new_list)
print(title_element_row.text.strip())
print(link_element_row)
print('\n\n')