1

There are quite similar scenarios regarding this; but I've been comparing with others. Getting from Clustered Nodes etc. But somehow; I'm unsure why my for loop isn't iterating and grabbing the text from other elements but only from the first element of the node.

from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
     try:
        for container in html_soup.find_all('div', {'class': 'section-trending-search-list'}):
            topic = container.select_one(
                'div._1waRmo')
            if topic:
                print(1)
                d = {
                    'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
        return d
    except:
        d = None

findDiv()
print(l)

the html elements i'm trying to access

DJanssens
  • 17,849
  • 7
  • 27
  • 42
Minial
  • 321
  • 2
  • 17
  • shouldn't this line read: topic = container.select_one('._1waRmo') -in other words, just the class name. Also the line html_soup.find_all('div', {'class': 'section-trending-search-list'}) will find only the root element, don't you need to html_soup.find_all('div') to enumerate all divs. Or if you want to enumerate everything under div class _25qBG5, then find that (call it say toplevel, then options = toplevel.find('div') and then for option in options. –  Jan 08 '19 at 08:07

2 Answers2

1

Try this: toplevel is finding the root of the options, then we find all divs under that. I hope this is what you want.

from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
    try:
        toplevel = html_soup.find('._25qBG5')
        for container in toplevel.find_all('div'):
            topic = container.select_one('._1waRmo')
            if topic:
                print(1)
                d = {'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
                return d
    except:
        d = None

findDiv()
print(l)

This enumerates fine with a local file. When I tried with the url given, the website wasn't returning the html you show.

from requests import get
from bs4 import BeautifulSoup

url = 'path_in_here\\test.html'
l = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}

example = open(url,"r")
text = example.read()

#response = get(url, headers=headers)
#html_soup = BeautifulSoup(response.text, 'html.parser')
html_soup = BeautifulSoup(text, 'html.parser')

print (text)

def findDiv():
    #try:
        print("finding toplevel")
        toplevel = html_soup.find("div", { "class":  "_25qBG5"} )
        print ("found toplevel")
        divs = toplevel.findChildren("div", recursive=True)
        print("found divs")

        for container in divs:
            print ("loop")
            topic = container.select_one('.1waRmo')
            if topic:
                print(1)
                d = {'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
                return d
    #except:
    #    d = None
    #    print ("error")

findDiv()
print(l)
  • `toplevel = html_soup.find('._25qBG5')` returns an empty value sadly. It is what I'm looking for and I understand the concept; but somehow it's returning `None` when I do call for it. – Minial Jan 08 '19 at 08:23
1
from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
     try:
        for container in html_soup.find_all('div', {'class': '_25qBG5'}):
            topic = container.select_one('div._1waRmo')
            if topic:
                d = {'Titles': topic.text.replace("\n", "")}
                l.append(d)
        return d
     except:
        d = None

findDiv()
print(l)

Output:

[{'Titles': 'school backpack'}, {'Titles': 'oppo case'}, {'Titles': 'baby chair'}, {'Titles': 'car holder'}, {'Titles': 'sling beg'}]

Again I suggest you use selenium. If you run this again you will see that you will get a different set of 5 dictionaries within the list. Every time you are making a request they are giving 5 random trending items. But they do have a 'change' button. If you use selenium, you might be able to just click that and keep scraping all trending items.

Bitto
  • 7,937
  • 1
  • 16
  • 38
  • My mistake on the header, as I've forgotten to re-edit the header to include `'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)` – Minial Jan 08 '19 at 08:25
  • @Minial I will delete my answer. And come up with the right solution. – Bitto Jan 08 '19 at 08:28
  • 1
    Thank you for the recommendation, I might find out more on **Selenium** because I'm sure it'll help smoothen alot of things for me in future. – Minial Jan 08 '19 at 08:43