1

Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class

I have tried the method stated here on this page How to get all the li tag within div tag and then the current logic shown below that is within my file

    #PRODUCT ATTRIBUTES (STYLE, SKU, BRAND)     need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
        m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
        stymatches = re.findall(m, html)

#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)

#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)   
brdmatches = re.findall(brd, html)

The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..

    itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']

and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.

SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE

<div class="pdp-desc-attr spec-prod-attr">
    <ul class="prod-attr-list">
        <li class="first first-item">Brand: adidas</li>
        <li>Country of Origin: Imported</li>
        <li class="last last-item">Style: F18AAW400D</li>   
    </ul>
</div>
Thiago Curvelo
  • 3,711
  • 1
  • 22
  • 38
CubanGT
  • 351
  • 3
  • 11

2 Answers2

2

Do not use Regex to parse HTML

There are better and safer ways to do this.

Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:

from parsel import Selector
from bs4 import BeautifulSoup

html = ('<div class="pdp-desc-attr spec-prod-attr">'
           '<ul class="prod-attr-list">'
             '<li class="first first-item">Brand: adidas</li>'
             '<li>Country of Origin: Imported</li>'
             '<li class="last last-item">Style: F18AAW400D</li>'
           '</ul>'
         '</div>')

# Using parsel
sel = Selector(text=html)

for li in sel.xpath('//li'):
    print(li.xpath('./text()').get())

# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

for li in soup.find_all('li'):
    print(li.text)

Output:

Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
  • I cant install other applications with approval from upper management.. So is the above example doable using Scrapy/Spyder/Anaconda? – CubanGT Apr 18 '19 at 20:21
  • I tried this and still doesnt find it xpath("//div[@class='pdp-desc-attr spec-prod-attr']//li/text()").extract() – CubanGT Apr 18 '19 at 21:05
  • Yes, the above code works with Scrapy, can you pass the URL you are trying to scrape? Maybe the content you want is generated by Javascript. – Luiz Rodrigues da Silva Apr 18 '19 at 21:35
  • r = requests.get('https://www.dickssportinggoods.com/p/adidas-womens-essentials-tricot-track-jacket-18adiwtrcttrckjckapo/18adiwtrcttrckjckapo') – CubanGT Apr 18 '19 at 21:37
  • If i do a "inspect" from chrome so that i go directly to that line of code you will find the above example of the elements im looking at, now if i right click and select copy xpath of that element, this is what is returned.. //*[@id="container_3074457345618270305"]/div/div[2]/div[2]/div[3]/div[2]/div/div[2]/div[2]/ul/li[1] – CubanGT Apr 18 '19 at 21:40
  • Are you using Scrapy? Why are you using requests to get the response? – Luiz Rodrigues da Silva Apr 18 '19 at 21:46
  • I tried to run scrapy shell the_url_you_pasted and I was able to get the element using the xpath provided. – Luiz Rodrigues da Silva Apr 18 '19 at 21:47
  • Im confused then, we are using scrapy, but cant seem to get to the elements we want.. using xpath or regex. I just tried the above suggestion with beautifulsoup and that actually worked.. # Using BeautifulSoup soup = BeautifulSoup(html, "html.parser") for li in soup.find_all('li'): print(li.text) BUT this returns ALL li elements on the page.. how can i use the above to now return only the 2 or 3 items we need? – CubanGT Apr 19 '19 at 14:02
0

I would use an html parser and look for the class of the ul. Using bs4 4.7.1

from bs4 import BeautifulSoup as bs

html = '''
<div class="pdp-desc-attr spec-prod-attr">
    <ul class="prod-attr-list">
        <li class="first first-item">Brand: adidas</li>
        <li>Country of Origin: Imported</li>
        <li class="last last-item">Style: F18AAW400D</li>   
    </ul>
</div>
'''

soup = bs(html, 'lxml')

for item in soup.select('.prod-attr-list:has(> li)'):
    print([sub_item.text for sub_item in item.select('li')])
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • When i try to install it i get this message (base) C:\>pip install beautifulsoup4 Requirement already satisfied: beautifulsoup4 in c:\programdata\anaconda3\lib\si te-packages (4.6.0) twisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed. distributed 1.21.8 requires msgpack, which is not installed. You are using pip version 10.0.1, however version 19.0.3 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' comm and. – CubanGT Apr 19 '19 at 13:17
  • Check what version your bs4 is. The last part of the message is saying you can upgrade pip if you want – QHarr Apr 19 '19 at 13:33
  • How do i know if i already have it installed? I checked the path and see that there is a beautifulsoup4-4.6.0-py3.6.egg-info folder in the site-packages – CubanGT Apr 19 '19 at 13:36
  • Package Version ---------------------------------- --------- asn1crypto 0.24.0 astroid 1.6.3 astropy 3.0.2 attrs 18.1.0 Automat 0.7.0 Babel 2.5.3 backcall 0.1.0 backports.shutil-get-terminal-size 1.0.0 beautifulsoup4 4.6.0 bitarray 0.8.1 – CubanGT Apr 19 '19 at 13:37
  • So you need to ensure upgrade version of bs4 – QHarr Apr 19 '19 at 13:38
  • Adding the import bs4 and just trying to create the soup i get this error – CubanGT Apr 19 '19 at 13:38
  • from bs4 import beautifulsoup ImportError: cannot import name 'beautifulsoup' – CubanGT Apr 19 '19 at 13:39
  • It is BeautifulSoup – QHarr Apr 19 '19 at 13:43
  • I think i got it, well at least not complaining about the import anymore.. ill try out the suggestion above and see if i can get it to work.. – CubanGT Apr 19 '19 at 13:44
  • 1
    yea i changed it to that and it not complaining – CubanGT Apr 19 '19 at 13:44
  • Tried the above sample and received this error 'Unsupported or invalid CSS selector: "%s"' % token) ValueError: Unsupported or invalid CSS selector: "li)" – CubanGT Apr 19 '19 at 13:49
  • Also, can you provide the url? – QHarr Apr 19 '19 at 20:06