0

Hi I have written a program to scape a page for URLs, by extracting the text contents "href" of, and appending to a base URL. The URL is then written into a cell on Google sheets via gspread.

The problem I am having is that every time I run the program it starts from cell 1 again. So I want to check for the highest empty cell, and run the program from there.

entire_wks=gsr.fetchEnitreSheet()

        numrows=len(entire_wks.col_values(1))

        for x in range(1,numrows+1):
            col=1
            row=x
            print(x)
            chem = entire_wks.cell(x, 1).value
            for item in soup.find_all('a'):
                if chem in str(item):
                    url=base_url+item.get('href') #pulls the href from the web page
                    print("updating cell, row=",x,"with url=",url)
                    entire_wks.update_cell(x, 2, url)
                    time.sleep(1) #just to stop the sheets API getting bombarded with too frequent requests

So I think I need something like this:

numrows=len(entire_wks.col_values(1))
last_cell= entire_wks.col(1).get_highest_row() ###I MADE THIS UP###

for x in range(last_cell,numrows+1):
#then the rest of the code to insert the new URLs into the blank cells

A screenshot of the Google Spreadsheet

A screenshot of the Google Spreadsheet

Could anyone enlighten me on how I could go about this?

Westworld
  • 190
  • 1
  • 2
  • 14
  • 1
    t starts from the first row because you tell it to in `for x in range(1,numrows+1):`. It's actually quite difficult to understand exactly what your objective looks like; please show a portion of the spreadsheet and the expected behaviour – roganjosh Feb 16 '19 at 12:06
  • I've edited the question to show a screenshot of the Google Sheet – Westworld Feb 16 '19 at 15:42
  • 1
    I think this as an answer here: https://stackoverflow.com/a/42476314/6241235 You want to find the next available row. – QHarr Feb 16 '19 at 16:03

0 Answers0