0

I have a list of 8000 website urls. I would like to scrape the text off of the websites and save everything as a csv file. To do this i wanted to save each text-page in a list. This is my code so far which is producing and "MemoryError".

import os
from splinter import *
import csv
import re
from inscriptis import get_text
from selenium.common.exceptions import WebDriverException


executable_path = {'executable_path' :'./phantomjs'}
browser = Browser('phantomjs', **executable_path)
links = []


with open('./Hair_Salons.csv') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    for row in spamreader:
        for r in row:
            links.append(r)

for l in links:
    if 'yelp' in l:
        links.remove(l)

df = []

for k in links:
    temp = []
    temp2 = []
    browser.visit(k)

    if len(browser.find_link_by_partial_text('About'))>0:
        about = browser.find_link_by_partial_text('About')
        print(about['href'])
        try:
            browser.visit(about['href'])
            temp.append(get_text(browser.html)) # <----- This is where the error is occuring
        except WebDriverException:
            pass
    else:
        browser.visit(k)
        temp.append(get_text(browser.html))
    for s in temp:
        ss = re.sub(r'[^\w]', ' ', s)
        temp2.append(ss)

    temp2 = ' '.join(temp2)
    print(temp2.strip())

    df.append(temp2.strip())

with open('Hair_Salons text', 'w') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(df)

How can i avoid getting a memory error?

briyan
  • 81
  • 1
  • 8
  • 1
    Send the data to a file during the loop rather than saving it all to later – doctorlove Jun 27 '17 at 14:28
  • @doctorlove how would i do that? I have tried it, but seem to overwrite my file each time the loop loops. – briyan Jun 27 '17 at 14:32
  • You should be clearing your "browser = Browser('phantomjs', **executable_path)" every time you move on to the next site. Something like "driver.quit()". This is likely your memory issue. – chocksaway Jun 27 '17 at 14:34

1 Answers1

1

If you can't hold all your data in memory, then don't. At a high level, your code has this structure

for k in links:
    temp = []
    temp2 = []
    browser.visit(k)

    # do stuff that fills in temp

    for s in temp:
        ss = re.sub(r'[^\w]', ' ', s)
        temp2.append(ss)

    temp2 = ' '.join(temp2)
    print(temp2.strip())

    df.append(temp2.strip())

with open('Hair_Salons text', 'w') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(df)

So, you put lots of stuff into a data frame, then write it - you don't use it in the loop. Instead of the df.append(temp2.strip()) write to the file there. Make you you either open the file once, outside the loop (perhaps more sensible) or open for appending (using 'a' instead of 'w').

doctorlove
  • 18,872
  • 2
  • 46
  • 62
  • I think i understand, however the file isn't being opened everytime the loop runs as of now does it? I was of the impression that it opens one time, once all text is in df. The memory issue seems to be at temp.append(get_text(browser.html)) – briyan Jun 27 '17 at 14:43
  • That is correct - you seem to open the file once, after (trying to) read all the data into memory. I am suggesting opening it, once, before the loop to read data and writing one line at a time. Or possibly re-opening in the loop but that's a bit daft. – doctorlove Jun 27 '17 at 14:45
  • Alright, i will try when i come home, and accept this if it works! – briyan Jun 27 '17 at 14:46