2

I'm trying to scrap the titles of all the products listed on a webpage of an E-Commerce site(in this case, Flipkart). Now, the products that I would be scraping would depend of the keyword entered by the user. A typical URL generated if I entered a product 'XYZXYZ' would be:

http://www.flipkart.com/search?q=XYXXYZ&as=off&as-show=on&otracker=start 

Now, using this link as a template, I wrote the following script to scrap the titles of all the products listed under any given webpage based on the keyword entered:

import requests
from bs4 import BeautifulSoup

def flipp(k):
    url = "http://www.flipkart.com/search?q=" + str(k) + "&as=off&as-show=on&otracker=start"
    ss = requests.get(url)
    src = ss.text
    obj = BeautifulSoup(src)
    for e in obj.findAll("a", {'class' : 'lu-title'}):
        title = e.string
        print unicode(title)

h = raw_input("Enter a keyword:")
print flipp(h)

However, the above script returns None as the output. When I tried to debug at each step, I found that the requests module is unable to get the source code of the webpage. What seems to be happening over here?

Manas Chaturvedi
  • 5,210
  • 18
  • 52
  • 104

2 Answers2

2

This does the trick,

import requests
from bs4 import BeautifulSoup
import re

def flipp(k):
    url = "http://www.flipkart.com/search?q=" + str(k) + "&as=off&as-show=on&otracker=start"
    ss = requests.get(url)
    src = ss.text
    obj = BeautifulSoup(src)
    for e in obj.findAll("a",class_=re.compile("-title")):
        title = e.text
        print title.strip()

h = raw_input("Enter a keyword:") # I used 'Python' here
print flipp(h)

Out[1]:
Think Python (English) (Paperback)
Learning Python (English) 5th  Edition (Hardcover)
Python in Easy Steps : Makes Programming Fun ! (English) 1st Edition (Paperback)
Python : The Complete Reference (English) (Paperback)
Natural Language Processing with Python (English) 1st Edition (Paperback)
Head First Programming: A learner's guide to programming using the Python language (English) 1st  Edition (Paperback)
Beginning Python (English) (Paperback)
Programming Python (English) 4Th Edition (Hardcover)
Computer Science with Python Language Made Simple - (Class XI) (English) (Paperback)
HEAD FIRST PYTHON (English) (Paperback)
Raspberry Pi User Guide (English) (Paperback)
Core Python Applications Programming (English) 3rd  Edition (Paperback)
Write Your First Program (English) (Paperback)
Programming Computer Vision with Python (English) 1st Edition (Paperback)
An Introduction to Python (English) (Paperback)
Fundamentals of Python: Data Structures (English) (Paperback)
Think Complexity (English) (Paperback)
Foundations of Python Network Programming: The comprehensive guide to building network applications with Python (English) 2nd Edition (Soft Cover)
Python Programming for the Absolute Beginner (English) (Paperback)
EXPERT PYTHON PROGRAMMING BEST PRACTICES FOR DESIGNING,CODING & DISTRIBUTING YOUR PYTHON 1st Edition (Paperback)
None
Md. Mohsin
  • 1,822
  • 3
  • 19
  • 34
  • This solution returns None as well. Also, I couldn't find any class named 'pu-title' in the source code. – Manas Chaturvedi Sep 28 '14 at 09:06
  • This code works perfect for me. We are mis connecting on something then. Can you please let me know the keyword that you are using – Md. Mohsin Sep 28 '14 at 09:15
  • Tried inputting 'java' and 'python'. The script returns None. And I still couldn't find any class named 'pu-title' in the source code of the webpage. Can you explain how you came up with the parameters inside your findAll function ? – Manas Chaturvedi Sep 28 '14 at 09:18
  • I see that now, the change in keyword changes the class as well. I did mobile and it had "pu-title" using python its lu-title. Let me repost my answer in a better way, give me few minutes – Md. Mohsin Sep 28 '14 at 09:24
  • I'm not sure what's wrong, but I still get None as my output even with your edited solution. This is really strange ! – Manas Chaturvedi Sep 28 '14 at 13:40
  • which python are you using? What OS? How are you running the scripts via? – Md. Mohsin Sep 28 '14 at 14:07
  • I'm using Python2.7 to run my scripts on my Windows using IDLE. – Manas Chaturvedi Sep 28 '14 at 14:14
  • And you are using the code (updated in my above answer ) as it is - without changing anything?? --- yielding only None as output? Are you sure about it? I tried thrice and I got the same output – Md. Mohsin Sep 28 '14 at 15:17
  • I know its hard to believe, but I'm using the exact same code that you posted as the solution. I don't see any reason why the code shouldn't work, and the solution seems correct, so I'm going to accept this as the correct answer anyway. But I'm still getting None as output for some weird reason. – Manas Chaturvedi Sep 28 '14 at 15:21
  • 1
    Well, Its really too hard to imagine. I am trying to guess all the possible reasons, but couldnt think of any. I am sure code is right. At least an error would be helpful - but you say there is no error :-( Do let me know if you find something, I would be glad to help – Md. Mohsin Sep 28 '14 at 15:27
0

The problem is that flipp has no return statement and you're therefore printing None (which is the default return value of any Python function in the absence of a return statement).

It could be that you're using keywords that have no results but I'm getting a page back with that script just fine.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • I don't think that the absence of a return statement is a problem here. I wrote a similar crawler for extracting data from Amazon and it works just fine. – Manas Chaturvedi Sep 28 '14 at 09:10
  • @ManasChaturvedi that's fine but in this code you're not returning anything from the function and that's why you're getting `None`. I don't see how you can claim it's not the problem when, in fact, that is the reason why you're getting `None`. – Simeon Visser Sep 28 '14 at 13:21
  • considering I replaced 'print title' with 'return title', I was expecting the first product's title on that webpage to be displayed followed by termination of the loop. However, I still get None in my output. – Manas Chaturvedi Sep 28 '14 at 13:50