-1

I am currently spending the week in a place with very faulty Internet service whilst simultaneously trying to scrape online data for a project. In particular, I am visiting each URL from a list of URLs and scraping a specific piece of data from each website to put into a CSV. The list of URLs is fairly large (33,000+ URLs), and I am finding it difficult to pick up where I left off when the internet goes down. Is there a way to do this quickly? Here is what I have:

def makeCSV(csv_src):
#END_TOKEN = " __END__ENTRY__"
with open(new_src, 'r') as f, open(csv_src, 'a') as fcsv:
    count = 40
    for i, url in enumerate(f):
        while i >= count and count < len(f.readlines()):
            count += 1
            wr = csv.writer(fcsv, quoting=csv.QUOTE_ALL)
            speaking, studying, entry, incorrect, correct = mineLearnerData(url)
            data = [speaking, studying, incorrect, correct]
            wr.writerow(data)
            #f2.write(str(entry + END_TOKEN) + '\n')
            print(count)
f.close(); fcsv.close()

'f' represents the file of URLs I am using. I am sending specific information from the URLs in that file to a specified CSV file path. 'Count' represents the next URL to be looked at. Ideally I'd also like to be able to use something line len(fcsv.readlines()) but I keep getting ascii errors.

Also, I'm open to suggestions regarding efficient ways to do this, as I am completely new to the data collection & cleaning process.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • sometimes it is easer to read all rows to memory, add new rows and write all rows to file. – furas Nov 27 '15 at 04:47

1 Answers1

0

Do not invoke f.readlines() more than once.

CSV is not a format suitable for modification. You should only use it for import/export of data.

For your use case, I would use a lightweight in-process db such as sqlite3 that provides transactions and crash recovery.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194