Python: html scraper to extract information between certain words from multiple pages with the same base url. Here is what I have so far

Question

import csv
import requests 
from bs4 import BeautifulSoup
from itertools import izip

grant_number = ['0901289','0901282','0901260']
#IMPORTANT NOTE: PLACE GRANT NUMBERS BETWEEN STRINGS WITH NO SPACES

start = 'this site'
end = 'Please report errors'
#start and end show the words that come right before the publication data
my_string = []
#my_string is an empty list for the publication data


for x in grant_number:      # Number of pages plus one 
    url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    soup_string = str(soup)
    my_string[int(x)] = soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)]
with open('NSF.csv', 'wb') as f:
    #Default Filename is NSF.csv ; This can be changed by editing the first field after 'open('
    writer = csv.writer(f)
    writer.writerows(izip(grant_number, my_string))
#this imports the lists into a csv file with two columns, grant number on left, publication data on right

Python is telling me that in

line 26, in my_string[int(x)] = soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)] IndexError: list assignment index out of range

How do I fix this?

score 1 · Accepted Answer · answered Jun 05 '16 at 19:45

1

The problem is that my_string[x] is attempting to get the x list index of my_string, but x is a string, as per your definition of the grant_number list.

You probably want to append to your initially empty string instead.

for x in grant_number:      # Number of pages plus one 
    url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    soup_string = str(soup)
    my_string.append(soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)])

answered Jun 05 '16 at 19:45

Jeremy Gordon

551
4
15

Thank you. This is awesome! any idea how to search the between those two words phrases without getting back html code, just strings? – epicdoe Jun 05 '16 at 19:50
Once you have your soup object you can get the textual data in a number of ways, e.g. soup_string = ' '.join([str(tag) for tag in soup.body]). More on identifying what's visible here: http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text – Jeremy Gordon Jun 05 '16 at 19:58

Python: html scraper to extract information between certain words from multiple pages with the same base url. Here is what I have so far

1 Answers1