You can actually do it with a single list comprehension.
Basically, what you have is the right approach, you just need to create a list of lists using your list comprehension.
For each match returned by soup.select
, you can extract both the text
and href
together.
Then, using the csv
module, you can pass this list of lists to csv.writerows
to create the CSV file for viewing in Excel or other tools, data processing, etc.
You can also optionally prepend a header to the list of lists, if you want, e.g. ['Title', 'URL']
.
Here is a full working example:
from bs4 import BeautifulSoup
import csv
import requests
url = 'https://learndataanalysis.org/python-tutorial/page/10'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]
# optional, if you want to add a header line
data.insert(0, ['Title', 'URL'])
with open('output_data.csv', 'w') as output_file:
writer = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerows(data)
Note that csv.QUOTE_ALL
isn't strictly necessary, but its often a good idea to force quoting on all fields.
If you instead want to export to an XLSX format, its best to use the pandas
module instead:
import pandas as pd
df = pd.DataFrame(data, columns=['Title', 'URL'])
df.to_excel('output_data.xlsx')
This will by default also export the row numbers. If you prefer to omit them, you can use the pandas.ExcelWriter
class, as in this post.
Edit:
If you want also want to extract the dates, then you can do so with a separate list comprehension (since the date information is in a different HTML element altogether).
Then, you can use zip
to combine the information together.
data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]
dates = [i.text for i in soup.select('span.published')]
data = [i + [j] for i, j in zip(data, dates)]