-1

I'm new to Python and I've managed to piece a script together but I'm struggling with writing to CSV despite having read lots about it.

My script (below) crawls a list of imported urls (pages to crawl) and reads all the paragraphs (p tags) which are within a section which has a class of 'holder'. There are a total of 4 'holder' sections.

I want to write the output to CSV where 'section' is the column header and each 'paragraph' forms the corresponding row.

Is this possible?

Here is my script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas
import csv

filename = "results.csv"
csv_writer = csv.writer(open(filename, 'w'))

contents = []
with open('list.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
    contents.append(url) # Add each url to list contents         

p = [[],[],[],[]]

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")
    n = 0

    for container in soup.find_all("section", {'class':'holder whyuse'}): # 
Ignore this section.
        container.decompose()

    for container in soup.find_all("section", {'class':'holder centertc'}): # 
Ignore this section.
        container.decompose()

    for container in soup.find_all("section",attrs={'class': 'holder'}):

        print('==','Section',n+1,'==')
        for paragraph in container.find_all("p"):
            p[n].append(paragraph)
            print(paragraph)
        n += 1

w = pandas.DataFrame(p).transpose()
w.columns=['Section 1','Section 2','Section 3','Section 4']
w.to_csv(results.csv)

that currently outputs 4 sections with paragraphs for each section, while I want the print('==','Section',n,'==') to form the CSV column headers and the print(paragraph) to generate the cell values in each column.

I presume I need some form of grouping to create 4 sections with associated paragraphs and export to a CSV.

Example output from current script from scraping 2 x url's from the import:

== Section 1 ==
<p>This is paragraph one in section one from the first url.</p>
<p>This section one has another paragraph here too in the first url.</p>
<p>Finally a third paragraph in section one of the first url.</p>
== Section 2 ==
<p>This is paragraph one in section two of the first url and only has one paragraph.</p>
== Section 3 ==
<p>This is the first paragraph in section 3 of the first url.</p>
<p>Section three in the first url has a second paragraph.</p>
== Section 4 ==
<p>Section four also only has one paragraph in the first url.</p>
== Section 1 ==
<p>This is the first paragraph in the second url.</p>
<p>The second url only has two paragraphs in section one.</p>
== Section 2 ==
<p>This is a paragraph in section two of the second url.</p>
<p>This is the second paragraph in section two of the second url.</p>
== Section 3 ==
<p>Section 3 in the second url only has one paragraph and this is it.</p>
== Section 4 ==
<p>This is the first paragraph in section four of the second url.</p>
<p>Section four of the second url also has this second paragraph.</p>
<p>Section four of the second url has three paragraphs.</p>

So the CSV needs 4 column headers (Section 1, Section 2, Section 3, Section 4) and each column needs the corresponding paragraphs e.g the column 'Section 1' will be populated with:

Col 1 / Section 1 - Row 1:
<p>This is paragraph one in section one from the first url.</p><p>This section one has another paragraph here too in the first url.</p><p>Finally a third paragraph in section one of the first url.</p>

Col 1 / Section 1 - Row 2:
<p>This is the first paragraph in the second url.</p><p>The second url only has two paragraphs in section one.</p>

Col 2 / Section 2 - Row 1:
<p>This is paragraph one in section two of the first url and only has one paragraph.</p>

Col 2 / Section 2 - Row 2:
<p>This is a paragraph in section two of the second url.</p>
<p>This is the second paragraph in section two of the second url.</p>

Etc etc

chappers
  • 466
  • 1
  • 6
  • 17
  • your table is rather strange. do you want to generate 1 csv or each url or do you want all urls into 1 csv? – weasel Feb 25 '20 at 11:48
  • Hi. All url's into one csv. The sections are the column headers in the csv. Rows will be the p-tags. Thanks. – chappers Feb 25 '20 at 11:50
  • It's probably possible, but without a [mre], it difficult to be specific because there's no input data to do development and testing with. – martineau Feb 25 '20 at 11:50
  • how do you want to separate data from different urls? – weasel Feb 25 '20 at 11:51
  • The script outputs the data into sections, so i just need the sections to be the column headers and the p-tag values being the rows for each column. – chappers Feb 25 '20 at 11:59
  • I imagine it is something like making a dataframe, and appending the data for the first blank cell. quite annoying the mods close questions so readily, as I believe I understand what you are asking but the question is closed – weasel Feb 25 '20 at 12:34
  • Its be re-opened @inyrface if you want to help in any way. Thanks – chappers Feb 25 '20 at 17:14
  • Could you please show us **the output** of your script and/or one of the URLs that you are using? Otherwise it's very difficult to understand the problem. – gboffi Feb 26 '20 at 09:28
  • Ok ive added an example of the output. – chappers Feb 26 '20 at 09:54
  • After spending some time trying to improve this question I've posted an answer that, due to some vagueness I still perceive in the Q, I'm not sure that is completely relevant for you. —— Could you please be so kind and tell me if my answer is OK? If not, I'd prefer to delete it... – gboffi Feb 27 '20 at 10:37
  • Answer deleted, thank you for your prompt reply. – gboffi Feb 27 '20 at 22:38

1 Answers1

1
p = [[],[],[],[]]

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")
    n = 0

    for container in soup.find_all("section",attrs={'class': 'holder'}):

        print('==','Section',n+1,'==')
        for paragraph in container.find_all("p"):
            p[n].append(paragraph)
            print(paragraph)
        n += 1

w = pandas.DataFrame(p).transpose()
w.columns=['Section 1','Section 2','Section 3','Section 4']
w.to_csv(csvname)
weasel
  • 534
  • 1
  • 5
  • 18
  • Thanks. Tried it but it changes the number of the printed 'sections' as well as throws the following error: AssertionError: 4 columns passed, passed data had 9 columns. Any way i can share my live script with you? – chappers Feb 27 '20 at 14:18
  • Thanks for your time on this. Getting there. It prints the sections correctly now but on the csv export i get an error: NameError: name 'l' is not defined – chappers Feb 28 '20 at 09:00
  • ah just change the variable and it is fine – weasel Feb 28 '20 at 12:24
  • Thanks again. I now get the error: NameError: name 'results' is not defined. Ive pasted my full script again the the question above. Sorry to be a pain! – chappers Feb 28 '20 at 15:03
  • Oh ive fixed it with: w.to_csv('results.csv'). Thanks. Quick question - Column one in the csv has the numbers 0 to 8. Can i somehow exclude these? – chappers Feb 28 '20 at 15:08
  • that's just the index you could look up other questions https://stackoverflow.com/questions/20845213/how-to-avoid-python-pandas-creating-an-index-in-a-saved-csv – weasel Feb 28 '20 at 23:48