Hi I have written a program to scape a page for URLs, by extracting the text contents "href" of, and appending to a base URL. The URL is then written into a cell on Google sheets via gspread.
The problem I am having is that every time I run the program it starts from cell 1 again. So I want to check for the highest empty cell, and run the program from there.
entire_wks=gsr.fetchEnitreSheet()
numrows=len(entire_wks.col_values(1))
for x in range(1,numrows+1):
col=1
row=x
print(x)
chem = entire_wks.cell(x, 1).value
for item in soup.find_all('a'):
if chem in str(item):
url=base_url+item.get('href') #pulls the href from the web page
print("updating cell, row=",x,"with url=",url)
entire_wks.update_cell(x, 2, url)
time.sleep(1) #just to stop the sheets API getting bombarded with too frequent requests
So I think I need something like this:
numrows=len(entire_wks.col_values(1))
last_cell= entire_wks.col(1).get_highest_row() ###I MADE THIS UP###
for x in range(last_cell,numrows+1):
#then the rest of the code to insert the new URLs into the blank cells
A screenshot of the Google Spreadsheet
Could anyone enlighten me on how I could go about this?