0

I have written a python script which scraps products from aliexpress.

Here is my code :

from selenium.webdriver.edge.options import Options  
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver  
from pymongo import MongoClient
from time import sleep
from lxml import html 
import pandas as pd
import cssselect
import pymongo
import json 
import csv 


options = Options()
options.headless = True
driver = webdriver.Edge(executable_path=r"C:\Users\aicha\Desktop\mycode\aliexpress_scrap\scrap\codes\msedgedriver",options=options)
url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=bluetooth+earphones&ltype=wholesale&SortType=default&page={}'
baseurl = 'https://www.aliexpress.com'

for page_nb in range(1, 2):
    print('---', page_nb, '---')
    
    driver.get(url.format(page_nb))
    sleep(2)
    current_offset = 0
    while True:
        driver.execute_script("window.scrollBy(0, window.innerHeight);")
        sleep(.5)  # JavaScript has time to add elements
        new_offset = driver.execute_script("return window.pageYOffset;")
        print(new_offset,current_offset)
        if new_offset <= current_offset:
            break
        current_offset = new_offset
    
    sleep(3)
    
    tree = html.fromstring(driver.page_source)
    
    results = []
    
    for product in tree.xpath('//div[@class="JIIxO"]//a'):
        title = product.xpath('.//h1/text()')
        
        if title:
            title = title[0]
            
            price = product.cssselect('div.mGXnE._37W_B span')
            price = [x.text for x in price]

            currency = price[0]
            price = ''.join(price[1:])
            stars = product.xpath('.//span[@class="eXPaM"]/text()')
            if stars :
                stars  = stars [0]
            else:
                stars  = 'None'
                
            nb_sold = product.xpath('.//span[@class="_1kNf9"]/text()')
            if nb_sold:
                nb_sold = nb_sold[0]
            else:
                nb_sold = 'None'
            supl = product.xpath('.//a[@class="ox0KZ"]/text()')
            if supl:
                supl = supl[0]
            else:
                supl = 'None'

            ship_cost = product.xpath('.//span[@class="_2jcMA"]/text()')
            if ship_cost:
                ship_cost = ship_cost[0]
            else:
                ship_cost = 'None'
            
            product_links = product.xpath('./@href')
            if product_links:
                product_links = str(baseurl) + str( product_links[0])
            
            row = [title, price, currency, stars, nb_sold, ship_cost, supl, product_links]
            results.append(row)
            print('len(results):', len(results))

    driver.close()
df = pd.DataFrame(results , columns=("Title","Price", "Currency", "Stars", "Orders", "Shipcost", "Supplier", "Productlinks" ))

####### Insert in database #############
client = MongoClient("mongodb://localhost:27017/")     
collection = client['db2']['aliex2']     
data = df.to_dict(orient = 'records')     
collection.insert_many(data) 

My question :

What I need is to add a timer that calculate the time of process and returns a value to know how much time takes the code. And also I want a method in order ro get a new collection after each scraping because when I run my code the second time, I get my data with the old collection.

I appreciate any help from you . Thank you !

  • 1
    check here https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution –  Apr 12 '22 at 14:46
  • 1
    Does this answer your question? [How do I get time of a Python program's execution?](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution) – HedgeHog Apr 12 '22 at 14:49

2 Answers2

0

May your problem solve in below code:

from selenium.webdriver.edge.options import Options  
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver  
from pymongo import MongoClient
from time import sleep
from lxml import html 
import pandas as pd
import cssselect
import pymongo
import json 
import csv 
import time as Time


options = Options()
options.headless = True
driver = webdriver.Edge(executable_path=r"C:\Users\aicha\Desktop\mycode\aliexpress_scrap\scrap\codes\msedgedriver",options=options)
url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=bluetooth+earphones&ltype=wholesale&SortType=default&page={}'
baseurl = 'https://www.aliexpress.com'

for page_nb in range(1, 2):
    print('---', page_nb, '---')
    
    driver.get(url.format(page_nb))
    sleep(2)
    current_offset = 0
    while True:
        driver.execute_script("window.scrollBy(0, window.innerHeight);")
        sleep(.5)  # JavaScript has time to add elements
        new_offset = driver.execute_script("return window.pageYOffset;")
        print(new_offset,current_offset)
        if new_offset <= current_offset:
            break
        current_offset = new_offset
    
    sleep(3)
    
    tree = html.fromstring(driver.page_source)
    
    results = []
    
    for product in tree.xpath('//div[@class="JIIxO"]//a'):
        start_time = Time.time()
        title = product.xpath('.//h1/text()')
        
        if title:
            title = title[0]
            
            price = product.cssselect('div.mGXnE._37W_B span')
            price = [x.text for x in price]

            currency = price[0]
            price = ''.join(price[1:])
            stars = product.xpath('.//span[@class="eXPaM"]/text()')
            if stars :
                stars  = stars [0]
            else:
                stars  = 'None'
                
            nb_sold = product.xpath('.//span[@class="_1kNf9"]/text()')
            if nb_sold:
                nb_sold = nb_sold[0]
            else:
                nb_sold = 'None'
            supl = product.xpath('.//a[@class="ox0KZ"]/text()')
            if supl:
                supl = supl[0]
            else:
                supl = 'None'

            ship_cost = product.xpath('.//span[@class="_2jcMA"]/text()')
            if ship_cost:
                ship_cost = ship_cost[0]
            else:
                ship_cost = 'None'
            
            product_links = product.xpath('./@href')
            if product_links:
                product_links = str(baseurl) + str( product_links[0])
            difference_time = Time.time() - start_time # calculate time taken by program
            
            row = [title, price, currency, stars, nb_sold, ship_cost, supl, product_links, difference_time] #difference_time store dataframe
            results.append(row)
            print('len(results):', len(results))

    driver.close()
df = pd.DataFrame(results , columns=("Title","Price", "Currency", "Stars", "Orders", "Shipcost", "Supplier", "Productlinks", "Time Taken"))

####### Insert in database #############
client = MongoClient("mongodb://localhost:27017/")     
collection = client['db2']['aliex2']     
data = df.to_dict(orient = 'records')     
collection.insert_many(data) 

I had find out time taken by program in loop and store that difference time in dataframe

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Devam Sanghvi
  • 548
  • 1
  • 5
  • 11
  • Actually it works, but I want to print how much time in my terminal not in my database –  Apr 13 '22 at 09:39
  • what do you mean by terminal? – Devam Sanghvi Apr 13 '22 at 09:51
  • From where you want to check time write this code:`start_time = Time.time()` and at last you can find difference time using `diffrence_time = Time.time() - start_time` – Devam Sanghvi Apr 13 '22 at 09:53
  • Thank you it works for this question. But I need also a method to create new collection after every execution code –  Apr 13 '22 at 10:19
  • Glad to help youplease create a new question for other problem I will surely help you – Devam Sanghvi Apr 13 '22 at 10:27
  • It is already in my question –  Apr 13 '22 at 10:27
  • But I can not understand your second requirement what you want as output? all scrap data and time is stored as dataframe is it not ok? – Devam Sanghvi Apr 13 '22 at 10:34
  • I want after each execution to store data in a new collection because when I execute the code for the second time, I get the data in the same collection. –  Apr 13 '22 at 10:46
  • Is it fix or dynamic that how many time you will execute the code? – Devam Sanghvi Apr 13 '22 at 10:58
  • No they are two seperated questions. –  Apr 13 '22 at 10:59
  • so please create a new question with all information and requirement If you need Help – Devam Sanghvi Apr 13 '22 at 11:01
  • Here is the link of my question https://stackoverflow.com/questions/71870535/how-to-create-new-collection-datatabase-after-each-scraping-execution. I would be grateful for any help –  Apr 14 '22 at 11:00
  • I think your question is solved by the answer of @Leon Menkreo.I will be gratefully if you need any more help. – Devam Sanghvi Apr 14 '22 at 11:33
  • No it is still need help because I don't know how to edit my django view –  Apr 14 '22 at 11:52
0

For this, you can utilize a time library.

import time
#Beginning of your code
timeStart = time.time()
....
....
....
print("%s seconds elapsed " % (time.time() - timeStart))

#End of your code

At the beginning, you should obtain the current time and assign it to a variable. After that, subtract it from the current time at the end of your code.