0

I am a serious newbie. I have a project and am running into trouble. I am to create a program to:

1.) scrape web links from a web site, 2.) remove duplicates, 3.) make sure all web links are in URI format, and 4.) write to a csv.

I am running into trouble around step 3. The first bit of code I am sharing below was one of my numerous failed attempts. The trouble seems to be either in I am failing to convert my set back to a list and the set is not mutable, or something ...I think something I am doing in Jupyter is causing it to loose its connection to the program and it doesn't recognize the way I am referencing the links I scraped. Please tell my where I am messing up.

FAILED ATTEMPT:

    save link as BeautifulSoup object
    soup= BeautifulSoup
    r= urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
    soup = BeautifulSoup(r,"html.parser") 
    links=set([a['href'] for a in soup.find_all('a',href=True)])  
    print(set) 
    print(links) 

    f=open('JessicasExport.csv','w', newline='') 
    writer=csv.writer(f,delimiter=',', lineterminator= '\r')
    set=MyList
    MyList=[set]
    ctr=0
    for x in MyList:
        MyList.update([x])
        if not MyList:
       ''
       elif hrefs.startswith(['#']):
            MyList.add(hrefs[1:])
       elif hrefs.startswith(['/']):
            MyList.add (['https://www.census.gov'+ hrefs])
       elif hrefs.endswith(['.gov']):
            MyList.add ([hrefs + '/'])
       else:
           MyList.add([hrefs])
    
           writer.writerow([MyList])
           del MyList[:]
           ctr += 1


     print('Number of urls written to CSV:' , ctr)
     f.close()

out []: #RESULTING ERROR

     AttributeError                            Traceback (most recent call last)
    <ipython-input-5-35e0479f6c2e> in <module>
     5 ctr=0
     6 for x in MyList:

----> 7 MyList.update([x]) 8 if not MyList: 9 ''

   AttributeError: 'list' object has no attribute 'update'

Then I tweaked it and tried this. This code below successfully spit out my scraped links but did not write anything to csv, and did not correct portions of code that were not in URI. But.....it produced NO ERROR codes, so I am perplexed...... Any help is so greatly appreciated! I have been waiting on a response from my teacher for a few days and am anxious to make progress.

PARTIALLY SUCESSFUL ATTEMPT, no errors but not file and not appended to uri

     import csv
     import requests
    from bs4 import BeautifulSoup
    import urllib.request
    import os



    soup= BeautifulSoup
    r= urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
    soup = BeautifulSoup(r,"html.parser") 
    links=set([a['href'] for a in soup.find_all('a',href=True)]) 
    print(set) 
    print(links) 


    f=open('check.csv', 'w', newline='')
    writer = csv.writer(f, delimiter=',', lineterminator='\r')
    Myset = set()
    MyList= [Myset]
    ctr=0    
    for x in Myset:
        MyList.append ([x])
        if not MyList:
           ''
        elif hrefs.startswith(['#']):
            MyList.add(hrefs[1:])
        elif hrefs.startswith(['/']):
            MyList.add (['https://www.census.gov'+ hrefs])
        elif hrefs.endswith(['.gov']):
            MyList.add ([hrefs + '/'])
        else:
            MyList.add([hrefs])
    
            writer.writerow([MyList])
            del MyList[:]
            ctr += 1

            f.close()

Thank you to all who review and make recommendations! I really want to understand.

  • What exactly do you want to do with `Myset = set()`? You have basically declared `Myset` to be a new (and thus empty) set, followed by an attempt to iterate through it. Since `Myset` has nothing in it, your for loop basically does nothing. Your links are stored in the variable `links`. – Mercury Oct 23 '20 at 00:33
  • Thank you for both very much for taking the time to review and give me this feedback! That makes a lot of sense. I tried myset=(links) but I got a name error. it was as though it didn't recognize links. I appreciate your taking the time to point out my error!! I'm in a 6 week class just as an introduction and with no background, I clearly needed to start w/ something more elemental. – Jessica Adams-Giles Oct 23 '20 at 23:53
  • You got a `NameError` where? Because it should've have worked `myset = links` (those parethenses you put do nothing as they are), because `links` has been defined. However, why did you wanted to assign the variable `myset` to `link` without changing anything? – Felipe Whitaker Oct 25 '20 at 04:10

1 Answers1

0

@Mercury is right, you defined a set (with capital M, which you shouldn't do, as you should follow PEP-8's convention) and then put it into an empty list (also with capital M): what are you trying to achieve? Also, when you wrote that empty string below the first if, I think you would like to learn about pass statement.

You might need to pip install lxml for this (lxml is a python library used for parsing).


import requests
from bs4 import BeautifulSoup
import os

def update_url(url):
    return # return updated url

req = requests.get('https://www.census.gov/programs-surveys/popest.html')
assert req.status_code == 200, f"Request returned with status {req.status_code}"

soup = BeautifulSoup(req.content, "lxml") 
links = set([a['href'] for a in soup.find_all('a',href = True)])

l = list():
with open('file_name.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\r')
    for url in links:
        new_url = update_url(url) # treat them as you wish
        writer.writerow(new_url) # write url to csv
# with statement closes file automatically
Felipe Whitaker
  • 470
  • 3
  • 9
  • I want it to go through my list and replace the segments of the url that are not in URI format . In the for loop each time I say elif href. startswith I get a NameError that says hrefs is not defined. I have searched all over , can anyone help me understand why ? – Jessica Adams-Giles Oct 24 '20 at 21:54
  • Maybe `hrefs` is not defined, but `href` is, if you are talking about the argument inside `soup.find_all`. You got the error because the function did not recognise the argument you passed. – Felipe Whitaker Oct 25 '20 at 04:08
  • i corrected that and I get NameError name 'href' is not defined. Any suggestions? I am struggling with how to define href above my for loop so that it recognizes that href is the attribute of 'a' without redefining href..... – Jessica Adams-Giles Oct 25 '20 at 13:29
  • I don't understand why it doesn't remember what href is in the for loop when it is defined in the code above. Is it because once the program cycles through that block of code does it not associate that with the for loop below? – Jessica Adams-Giles Oct 25 '20 at 14:27
  • The code is always read from start to finish, remembering everything from where it has already passed. Besides that, your code (at least the one you posted) does not define `href`, it just passes it as an argument to `soup.find_all`. Furthermore (I think now I understood your question), i think you are calling `hrefs` when you actually want to use the variable `x`, which is the looping variable that you are using inside your `for loop`. One last thing, `Myset` is an empty set: it does not have any of the `href`s you want: the object that has them is `links`, which is a list. – Felipe Whitaker Oct 25 '20 at 20:10
  • With what i said in mind, I think that if you change your `for` loop to `for hrefs in links:`, it might work like a charm. – Felipe Whitaker Oct 25 '20 at 20:11