I'm new to Python and I've managed to piece a script together but I'm struggling with writing to CSV despite having read lots about it.
My script (below) crawls a list of imported urls (pages to crawl) and reads all the paragraphs (p tags) which are within a section which has a class of 'holder'
. There are a total of 4 'holder'
sections.
I want to write the output to CSV where 'section'
is the column header and each 'paragraph'
forms the corresponding row.
Is this possible?
Here is my script:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas
import csv
filename = "results.csv"
csv_writer = csv.writer(open(filename, 'w'))
contents = []
with open('list.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
p = [[],[],[],[]]
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
n = 0
for container in soup.find_all("section", {'class':'holder whyuse'}): #
Ignore this section.
container.decompose()
for container in soup.find_all("section", {'class':'holder centertc'}): #
Ignore this section.
container.decompose()
for container in soup.find_all("section",attrs={'class': 'holder'}):
print('==','Section',n+1,'==')
for paragraph in container.find_all("p"):
p[n].append(paragraph)
print(paragraph)
n += 1
w = pandas.DataFrame(p).transpose()
w.columns=['Section 1','Section 2','Section 3','Section 4']
w.to_csv(results.csv)
that currently outputs 4 sections with paragraphs for each section, while I want the print('==','Section',n,'==')
to form the CSV column headers and the print(paragraph)
to generate the cell values in each column.
I presume I need some form of grouping to create 4 sections with associated paragraphs and export to a CSV.
Example output from current script from scraping 2 x url's from the import:
== Section 1 ==
<p>This is paragraph one in section one from the first url.</p>
<p>This section one has another paragraph here too in the first url.</p>
<p>Finally a third paragraph in section one of the first url.</p>
== Section 2 ==
<p>This is paragraph one in section two of the first url and only has one paragraph.</p>
== Section 3 ==
<p>This is the first paragraph in section 3 of the first url.</p>
<p>Section three in the first url has a second paragraph.</p>
== Section 4 ==
<p>Section four also only has one paragraph in the first url.</p>
== Section 1 ==
<p>This is the first paragraph in the second url.</p>
<p>The second url only has two paragraphs in section one.</p>
== Section 2 ==
<p>This is a paragraph in section two of the second url.</p>
<p>This is the second paragraph in section two of the second url.</p>
== Section 3 ==
<p>Section 3 in the second url only has one paragraph and this is it.</p>
== Section 4 ==
<p>This is the first paragraph in section four of the second url.</p>
<p>Section four of the second url also has this second paragraph.</p>
<p>Section four of the second url has three paragraphs.</p>
So the CSV needs 4 column headers (Section 1, Section 2, Section 3, Section 4) and each column needs the corresponding paragraphs e.g the column 'Section 1' will be populated with:
Col 1 / Section 1 - Row 1:
<p>This is paragraph one in section one from the first url.</p><p>This section one has another paragraph here too in the first url.</p><p>Finally a third paragraph in section one of the first url.</p>
Col 1 / Section 1 - Row 2:
<p>This is the first paragraph in the second url.</p><p>The second url only has two paragraphs in section one.</p>
Col 2 / Section 2 - Row 1:
<p>This is paragraph one in section two of the first url and only has one paragraph.</p>
Col 2 / Section 2 - Row 2:
<p>This is a paragraph in section two of the second url.</p>
<p>This is the second paragraph in section two of the second url.</p>
Etc etc