I am a serious newbie. I have a project and am running into trouble. I am to create a program to:
1.) scrape web links from a web site, 2.) remove duplicates, 3.) make sure all web links are in URI format, and 4.) write to a csv.
I am running into trouble around step 3. The first bit of code I am sharing below was one of my numerous failed attempts. The trouble seems to be either in I am failing to convert my set back to a list and the set is not mutable, or something ...I think something I am doing in Jupyter is causing it to loose its connection to the program and it doesn't recognize the way I am referencing the links I scraped. Please tell my where I am messing up.
FAILED ATTEMPT:
save link as BeautifulSoup object
soup= BeautifulSoup
r= urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
soup = BeautifulSoup(r,"html.parser")
links=set([a['href'] for a in soup.find_all('a',href=True)])
print(set)
print(links)
f=open('JessicasExport.csv','w', newline='')
writer=csv.writer(f,delimiter=',', lineterminator= '\r')
set=MyList
MyList=[set]
ctr=0
for x in MyList:
MyList.update([x])
if not MyList:
''
elif hrefs.startswith(['#']):
MyList.add(hrefs[1:])
elif hrefs.startswith(['/']):
MyList.add (['https://www.census.gov'+ hrefs])
elif hrefs.endswith(['.gov']):
MyList.add ([hrefs + '/'])
else:
MyList.add([hrefs])
writer.writerow([MyList])
del MyList[:]
ctr += 1
print('Number of urls written to CSV:' , ctr)
f.close()
out []: #RESULTING ERROR
AttributeError Traceback (most recent call last)
<ipython-input-5-35e0479f6c2e> in <module>
5 ctr=0
6 for x in MyList:
----> 7 MyList.update([x]) 8 if not MyList: 9 ''
AttributeError: 'list' object has no attribute 'update'
Then I tweaked it and tried this. This code below successfully spit out my scraped links but did not write anything to csv, and did not correct portions of code that were not in URI. But.....it produced NO ERROR codes, so I am perplexed...... Any help is so greatly appreciated! I have been waiting on a response from my teacher for a few days and am anxious to make progress.
PARTIALLY SUCESSFUL ATTEMPT, no errors but not file and not appended to uri
import csv
import requests
from bs4 import BeautifulSoup
import urllib.request
import os
soup= BeautifulSoup
r= urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
soup = BeautifulSoup(r,"html.parser")
links=set([a['href'] for a in soup.find_all('a',href=True)])
print(set)
print(links)
f=open('check.csv', 'w', newline='')
writer = csv.writer(f, delimiter=',', lineterminator='\r')
Myset = set()
MyList= [Myset]
ctr=0
for x in Myset:
MyList.append ([x])
if not MyList:
''
elif hrefs.startswith(['#']):
MyList.add(hrefs[1:])
elif hrefs.startswith(['/']):
MyList.add (['https://www.census.gov'+ hrefs])
elif hrefs.endswith(['.gov']):
MyList.add ([hrefs + '/'])
else:
MyList.add([hrefs])
writer.writerow([MyList])
del MyList[:]
ctr += 1
f.close()
Thank you to all who review and make recommendations! I really want to understand.