Web-scrape key words that are each on a different url with python and bs4

Question

I am currently working on a script that pulls info from this page.

I want to pull the alt of each item which I have successfully done with:

image = soup.find_all('img')
for i in image:
    print(i['alt'])'

which just gets me the code I need for those 'Item's. I also want to be able to find the name of these items and .append them to the code (so I can know which code belongs to which 'Item')

But I cannot find the name of the item in the html of the shop page with "all" (which is what I am currently downloading with urllib2 read()), you can only find the name of the Item in the category or when you click on the item and the size selection and add to cart button appears.

I want to print the 'Item' code along with the Item name and color (all together). Im thinking of creating all the different urls for each category, and finding all the information that way, but that would take me a while.

I was wondering if anyone would be able to help me out, and provide me with a quick and easy script I could use to perform these tasks with.

I am using python 2.7, bs4, and urllib2

I have attempted this script aswell can anyone tell me why it does not work I have been trying to fix it for hours

from bs4 import BeautifulSoup as bs
import urllib2

URL1 = ('http://www.supremenewyork.com/shop/all/jackets')
sauce1 = urllib2.urlopen(URL1).read()
soup1 = bs(sauce1,'lxml')

for name1 in soup1.find_all(attrs={'class':'name-link'}):
    image1 = soup1.find_all('img')
    for i in image1:
        code1 = (i['alt'])+ '  '+(name1.text)
    print(code1)

(not sure why the indents arent correct on here But the script executes the name1.text perfectly but then the i['alt'] ALWAYS prints the last word that it can find so it ends up being like this

Bxvxpc8 dng  

Supreme®/Nike®/NBA Teams Warm-Up Jacket
Bxvxpc8 dng  Denim
Bxvxpc8 dng  Supreme®/Nike®/NBA Teams Warm-Up Jacket
Bxvxpc8 dng  White
Bxvxpc8 dng  Supreme®/Nike®/NBA Teams Warm-Up Jacket
Bxvxpc8 dng  Black
Bxvxpc8 dng  


Washed Work Trench Coat
Bxvxpc8 dng  

Floral
Bxvxpc8 dng  


Washed Work Trench Coat
Bxvxpc8 dng  Dusty Teal
Bxvxpc8 dng  


Washed Work Trench Coat
Bxvxpc8 dng  Black
Bxvxpc8 dng  Washed Work Trench Coat
Bxvxpc8 dng  White

I have tried switching the two variables around but then it prints the last color it can find and the code works fine help me please

score 0 · Accepted Answer · answered Mar 12 '18 at 05:58

0

I've used requests module instead of urllib2 which I recommend you to use.

The basic strategy here is, get the links to all the jackets, and then scrape the name and style for each of them individually.

Complete code:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.supremenewyork.com/shop/all/jackets')
soup = BeautifulSoup(r.text, 'lxml')

for item in soup.find_all('div', class_='inner-article'):
    url = item.a['href']
    alt = item.find('img')['alt']
    req = requests.get('http://www.supremenewyork.com' + url)
    jacket_soup = BeautifulSoup(req.text, 'lxml')
    name = jacket_soup.find('h1', itemprop='name').text
    style = jacket_soup.find('p', itemprop='model').text

    print(alt, name, style)

Output:

Zbjng0wx ys Supreme®/Nike®/NBA Teams Warm-Up Jacket Denim
U 6wgdlykaw Supreme®/Nike®/NBA Teams Warm-Up Jacket White
Uqwzgxwn aw Supreme®/Nike®/NBA Teams Warm-Up Jacket Black
 ncksopv9nw Washed Work Trench Coat Floral
Iucyf1nlqi0 Washed Work Trench Coat Dusty Teal
Aiqn291frva Washed Work Trench Coat Black
Ttnqgbiqexi Washed Work Trench Coat White
Rlpgq3dzbdk Reflective Taping Hooded Pullover Orange
7cmo7ppbbv8 Reflective Taping Hooded Pullover Tan
Gyy1gqljohi Reflective Taping Hooded Pullover Green
Xgjdkznfyxi Reflective Taping Hooded Pullover Black

answered Mar 12 '18 at 05:58

Keyur Potdar

7,158
6
25
40

@ Keyur Potdar Hey man your solution worked but when I print I get 'u''s and lots of ''' and '(' and ' )' that I would like to get rid of I tried name.remove('u') and stuff like that but none of them work this is what It prints ('Zbjng0wx ys', u'Supreme\xae/Nike\xae/NBA Teams Warm-Up Jacket', u'Denim') – bobo Mar 12 '18 at 15:23
Don't worry about the u's. they are just unicode strings. Have a look at this question: https://stackoverflow.com/questions/1207457/convert-a-unicode-string-to-a-string-in-python-containing-extra-symbols – Keyur Potdar Mar 12 '18 at 15:25
@ Keyur Potdar ok thank you I just have one more question you dont have to awnser but feel free to... I want to have this code pull this info from the supreme website in United Kindom (uk) I looked up some stuff online and it says something about proxys but It looks very complicated to me Since they release these codes and Items earlier than the us and canada website and I can use them before the release – bobo Mar 12 '18 at 15:28
I keep on getting this error print (alt, str(name), str(style)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 7: ordinal not in range(128) Please help I just want ot get rid of the u and the ''''' and the '(' and ')' please help man – bobo Mar 12 '18 at 15:53
I've never encountered these problems, so I don't know which solution works. But have a look at this question: https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 . Just Google the errors and you will get good answers on SO. – Keyur Potdar Mar 12 '18 at 15:55
I have fixed the problem u just have to get rid of the ( ) lmaooo – bobo Mar 13 '18 at 02:12
I just saw that you are using python-2.x, the `print()` syntax is for python3.x – Keyur Potdar Mar 13 '18 at 02:12

Web-scrape key words that are each on a different url with python and bs4

1 Answers1