0

I have this below script which I used currently. I extract data from a website using multiple URLs in excel. I want to save extracted data in an excel file with each extraction.

In Short, I need a program which

  1. Open Excel File
  2. Save extracted data in the last raw.
  3. Close the excel file.

Currently, my code fetches the entire data first then paste data into a CSV file. Due to that, I am not able to extract information for more than 1000 links.

Thanks

for item in exampleData:
    final.append(item[0])
    
for url in final:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    results = soup.find_all('div',{'id':'dp'})

    for item in results:
        record = extract_record(item)
        if record:
            records.append(record)
            
driver.close()

df = pd.DataFrame(records)
df.columns = ['item_name', 'price_Final', 'Final_bullet','prod_desc','Image_link', 'Stock', 'Prod_spec ','Quantity_avail','Prod_spec','Brand_info','Add_info','Delivered_by','BTG']
writer = ExcelWriter('Desktop/Data Scrap.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

  • I may be alone in this, but it's not clear to me by your description what problem you are describing with "Currently, my code fetches the entire data first then paste data into a CSV file. Due to that, I am not able to extract information for more than 1000 links." You show no code related to that (I don't see a CSV file being accessed) or why 1000 links is a limitation. Also, you should call `driver.quit()`, not `driver.close()`; you are not cleaning up the driver session properly. – Booboo Nov 05 '20 at 11:36
  • I am very sorry for the confusion. With my current script, extracted outputs only save to the CSV file after all the links have been successfully processed. Not a single data output is stored if the script breaks for any reason. I want to make some modifications to my current script so that the program saves the file with every line and does not wait until all the code runs successfully. – Arshad Ali Nov 05 '20 at 11:57
  • The confusion arose because (1) you are writing an `xlsx` file, not a csv file and (2) you said you can only process 1000 links but did not say why. Now it seems you are saying you need this just in case something goes wrong with the processing of one of the links. Of course. you can always have a `try/execpt` handler in your URL-processing loop and if you get an error either skip the URL or just break out of the loop. Wouldn't that work for you? – Booboo Nov 05 '20 at 12:33
  • @Booboo Thanks a lot for your reply. 1. I can extract 1000 links details at a time after the 1000 link my system got hanged I think due to a temporary file or I don't know the exact reason. And I have tried/ except handler in my code and reason of break was always some sort of network failure internet issue etc... and once the code break I need to run the whole script again. I think there is no resume from last resort options. – Arshad Ali Nov 06 '20 at 07:13
  • See [Append existing excel sheet with new dataframe using python pandas](https://stackoverflow.com/questions/38074678/append-existing-excel-sheet-with-new-dataframe-using-python-pandas). Look at the response that talks about using a CSV file. That seems to be the more efficient way of going. When you are done, you can convert the CSV to an XLSX. file. – Booboo Nov 06 '20 at 13:04

1 Answers1

0

See my comment about how you might modify your code to append to an EXCEL file or CSV file. The following code takes your current code (but would be equally useful for your code when modified to append as you go along) and modifies it to use a thread pool for retrieving URLs. It not only will increase the performance, but if a retrieval of a URL hangs, you still have N - 1 threads left for processing (assuming they don't start hanging, too).

The tricky bit, which also greatly increases performance, is to only start up a Selenium driver once per thread and not for each URL request. A reference to the driver for each thread in the pool is kept in thread local storage. The reference to thread local storage is ultimately removed (and for good measure, garbage collection is run) to force the destructor call on the Driver instances that ensure a call to driver.quit() is called for each dirver instance.

Again, this code can be modified to append the results returned from the call to process_url to append the results to an existing xlsx or cvs file rather than appending the results to records, creating single large dataframe and cretaing the spreadsheet in one shot.

from concurrent.futures import ThreadPoolExecutor, TimeoutError
from selenium import webdriver
from bs4 import BeautifulSoup
import threading
import gc


threadLocal = threading.local()

class Driver:
    def __init__(self):
        # example of how you might be creating your driver:
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver

def process_url(url):
    driver = create_driver()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    results = soup.find_all('div',{'id':'dp'})
    records = []
    for item in results:
        record = extract_record(item)
        if record:
            records.append(record)
    return records

N_THREADS = 20 # play around with this number
TIMEOUT = 10 # maximum time for process_url to complete
records = []
with ThreadPoolExecutor(N_THREADS) as executor:
    futures = [executor.submit(process_url, url) for url in final]
    for future in futures:
        try:
            new_records = future.result(timeout=TIMEOUT)
            records.extend(new_records)
        except TimeoutError:
            print('TimeoutError: this thread may be permanently hung.')

# clean up drivers
threadLocal = None # run destructors
gc.collect()
df = pd.DataFrame(records)
df.columns = ['item_name', 'price_Final', 'Final_bullet','prod_desc','Image_link', 'Stock', 'Prod_spec ','Quantity_avail','Prod_spec','Brand_info','Add_info','Delivered_by','BTG']
writer = ExcelWriter('Desktop/Data Scrap.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

To Append to a CSV File As You Go Along

Change the loop that process the Future instances to:

for i, future in enumerate(futures):
    try:
        new_records = future.result(timeout=TIMEOUT)
        with open('Desktop/Data Scrap.csv', 'w' if i == 0 else 'a', newline='') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            if i == 0:
                csvwriter.writerow(['item_name', 'price_Final', 'Final_bullet','prod_desc','Image_link', 'Stock', 'Prod_spec ','Quantity_avail','Prod_spec','Brand_info','Add_info','Delivered_by','BTG'])
            for record in new_records:
                csvwriter.writerow(record)
    except TimeoutError:
        print('TimeoutError: this thread may be permanently hung.')
Booboo
  • 38,656
  • 3
  • 37
  • 60