I am trying to scrape a website:- https://media.info/newspapers/titles This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.
Below is my code to scrape the URLs of all the newspapers starting from A to Z:-
driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[@class='pages']//a")
for i in pages:
page_title.append(i.get_attribute("href"))
names = []
for i in page_title:
driver.get(i)
time.sleep(1)
name = driver.find_elements(By.XPATH,"//div[@class='info thumbBlock']//a")
for i in name:
names.append(i.get_attribute("href"))
len(names) :-> 1688
names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
'https://media.info/newspapers/titles/abergavenny-free-press',
'https://media.info/newspapers/titles/abergavenny-gazette-diary',
'https://media.info/newspapers/titles/the-abingdon-herald',
'https://media.info/newspapers/titles/academies-week']
moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.
test = []
c = 0
for i in names:
driver.get(i)
time.sleep(2)
r = requests.get(i)
soup = BeautifulSoup(r.content,'lxml')
try:
name = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/div[3]/h1").text
try:
twitter = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
except:
twitter = None
try:
twitter_followers = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
except:
twitter_followers = None
people = []
try:
persons = driver.find_elements(By.XPATH,"//div[@class='columns']")
for i in persons:
people.append(i.text)
except:
people.append(None)
try:
owner = soup.select_one('th:contains("Owner") + td').text
except:
owner = None
try:
postal_address = soup.select_one('th:contains("Postal address") + td').text
except:
postal_address = None
try:
Telephone = soup.select_one('th:contains("Telephone") + td').text
except:
Telephone = None
try:
company_website = soup.select_one('th:contains("Official website") + td > a').get('href')
except:
company_website = None
try:
main_email = soup.select_one('th:contains("Main email") + td').text
except:
main_email = None
try:
personal_email = soup.select_one('th:contains("Personal email") + td').text
except:
personal_email = None
r2 = requests.get(company_website)
soup2 = BeautifulSoup(r2.content,'lxml')
try:
is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
except:
is_wordpress = None
news_Data = {
"Name": name,
"Owner": owner,
"Postal Address": postal_address,
"main Email":main_email,
"Telephone": Telephone,
"Personal Email": personal_email,
"Company Wesbite": company_website,
"Twitter_Handle": twitter,
"Twitter_Followers": twitter_followers,
"People":people,
"Is Wordpress?":is_wordpress
}
test.append(news_Data)
c=c+1
print("completed",c)
except Exception as Argument:
print(f"There is an exception with {i}")
pass
I am using both Selenium and BesutifulSoup with requests to scrape the data. The code is fulfilling the requirements.
- Firstly, is it a good practice to use it in this manner like using selenium and soup in the same code?
- Secondly, the code is taking too much time. is there any alternate way to reduce the runtime of the code?