1

I am crawling some data from a website. I need to recover some links from a list of products. First I identified one of the links with inspect element:

enter image description here

Then I used request to save all the source code of that page in a text file:

source_code = requests.get(link)
plain_text= source_code.txt

Then I used my text editor to search the link and it did not found it. Im working with BeautifulSoup4, but I already tried several different ways to crawl the page to get the list of products but all give the same result.

My suspicion is that the list of product is generated by some code (probably Java) when someone enter the page, but I am not sure. I have been several hours trying to make this work so any hint is going to be appreciated.

Community
  • 1
  • 1
Renato Sanhueza
  • 534
  • 7
  • 27

1 Answers1

0

Python never stop to amuse me. I found a Python library that uses PhantomJS. It allow us to run JavaScript code inside a python program. I will answer my own question after a lot of work:

from ghost import Ghost
import re

def filterProductLinks(links):  #filter the useless links using regex
   pLinks= list()
   for l in links:
      if re.match(".*productDetails.*",str(l)):
         pLinks.append(l)
   return pLinks #List of item url(40 max)

def getProductLinks(url):   #get the links generated by Java code
   ghost = Ghost(wait_timeout=100)
   ghost.open(url)
   links = ghost.evaluate("""
                    var links = document.querySelectorAll("a");
                    var listRet = [];
                    for (var i=0; i<links.length; i++){
                        listRet.push(links[i].href);
                    }
                    listRet;
                """)
   pLinks= filterProductLinks(links[0])
   return pLinks

#Test
pLinks= getProductLinks('http://www.lider.cl/walmart/catalog/category.jsp?id=CF_Nivel3_000042&pId=CF_Nivel1_000003&navAction=jump&navCount=0#categoryCategory=CF_Nivel3_000042&pageSizeCategory=20&currentPageCategory=1&currentGroupCategory=1&orderByCategory=lowestPrice&lowerLimitCategory=0&upperLimitCategory=0&&504')
for l in pLinks:
   print l
print len(pLinks)

The Java code is not mine. I took it from a Ghost.py documentation page: Ghost.py Documentation

Renato Sanhueza
  • 534
  • 7
  • 27