I am currently spending the week in a place with very faulty Internet service whilst simultaneously trying to scrape online data for a project. In particular, I am visiting each URL from a list of URLs and scraping a specific piece of data from each website to put into a CSV. The list of URLs is fairly large (33,000+ URLs), and I am finding it difficult to pick up where I left off when the internet goes down. Is there a way to do this quickly? Here is what I have:
def makeCSV(csv_src):
#END_TOKEN = " __END__ENTRY__"
with open(new_src, 'r') as f, open(csv_src, 'a') as fcsv:
count = 40
for i, url in enumerate(f):
while i >= count and count < len(f.readlines()):
count += 1
wr = csv.writer(fcsv, quoting=csv.QUOTE_ALL)
speaking, studying, entry, incorrect, correct = mineLearnerData(url)
data = [speaking, studying, incorrect, correct]
wr.writerow(data)
#f2.write(str(entry + END_TOKEN) + '\n')
print(count)
f.close(); fcsv.close()
'f' represents the file of URLs I am using. I am sending specific information from the URLs in that file to a specified CSV file path. 'Count' represents the next URL to be looked at. Ideally I'd also like to be able to use something line len(fcsv.readlines()) but I keep getting ascii errors.
Also, I'm open to suggestions regarding efficient ways to do this, as I am completely new to the data collection & cleaning process.