-1

I am training on how to scrape some data in python and here's my try:

import requests
from bs4 import BeautifulSoup

url = 'https://learndataanalysis.org/python-tutorial/page/10'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
links = [i['href'] for i in soup.select('h2.entry-title a')]
print(links)

The code gets the links of the webpage. I can use this line to get the titles of each tutorial:

[i.text for i in soup.select('h2.entry-title a')]

How can I make a list of both the links and the titles and finally to export the results to excel file?

Simply I need a column for the titles of the articles and the other column for the link of each article.

costaparas
  • 5,047
  • 11
  • 16
  • 26
YasserKhalil
  • 9,138
  • 7
  • 36
  • 95

1 Answers1

1

You can actually do it with a single list comprehension.

Basically, what you have is the right approach, you just need to create a list of lists using your list comprehension.

For each match returned by soup.select, you can extract both the text and href together.

Then, using the csv module, you can pass this list of lists to csv.writerows to create the CSV file for viewing in Excel or other tools, data processing, etc.

You can also optionally prepend a header to the list of lists, if you want, e.g. ['Title', 'URL'].

Here is a full working example:

from bs4 import BeautifulSoup

import csv
import requests

url = 'https://learndataanalysis.org/python-tutorial/page/10'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]

# optional, if you want to add a header line
data.insert(0, ['Title', 'URL'])

with open('output_data.csv', 'w') as output_file:
    writer = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_ALL)
    writer.writerows(data)

Note that csv.QUOTE_ALL isn't strictly necessary, but its often a good idea to force quoting on all fields.


If you instead want to export to an XLSX format, its best to use the pandas module instead:

import pandas as pd
df = pd.DataFrame(data, columns=['Title', 'URL'])
df.to_excel('output_data.xlsx')

This will by default also export the row numbers. If you prefer to omit them, you can use the pandas.ExcelWriter class, as in this post.


Edit:

If you want also want to extract the dates, then you can do so with a separate list comprehension (since the date information is in a different HTML element altogether).

Then, you can use zip to combine the information together.

data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]
dates = [i.text for i in soup.select('span.published')]
data = [i + [j] for i, j in zip(data, dates)]
costaparas
  • 5,047
  • 11
  • 16
  • 26
  • 1
    Amazing. Thank you very much. I will study the code carefully so as to learn new skills which I don't know. – YasserKhalil Feb 14 '21 at 07:20
  • Is it possible to use comprehension to add a third item. I mean what if I need to add the date of the article by editing this line `data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]`? – YasserKhalil Feb 14 '21 at 07:34
  • @yasserkhalil sure, I've updated the answer to also extract the dates. – costaparas Feb 14 '21 at 07:48
  • What if the dates are in the same HTML element? I wish to know if it is possible to deal with multiple cases in one comprehension? – YasserKhalil Feb 14 '21 at 07:50
  • 1
    @yasserkhalil Sure, it would work the same way, instead of `[i.text, i['href']]` it would be something like `[i.text, i['href'], extra]` for any `extra` field you want. Each inner list becomes a row of the file, and you can add as many columns as you need. – costaparas Feb 14 '21 at 07:52
  • 1
    Thank you very very much. Best Regards – YasserKhalil Feb 14 '21 at 07:56