BeautifulSoup: Parse JavaScript dynamic content

Question

I am developing a python web scraper with BeautifulSoup that parses "product listings" from this website and extracts some information for each product listing (i.e., price, vendor, etc.). I am able to extract many of this information but one (i.e., the product quantity), which seems to be hidden from the raw html. Looking at the webpage through my browser what I see is (unid = units):

product_name       1 unid      $10.00

but the html for that doesn't show any integer value that I can extract. It shows this html text:

<div class="e-col5 e-col5-offmktplace ">
  <div class="kWlJn zYaQqZ gQvJw">&nbsp;</div> 
  <div class="imgnum-unid"> unid</div>
</div>

My question is how do I get this hidden content of e-col5 which stores the product quantity?

import re
import requests
from bs4 import BeautifulSoup

page = requests.get("https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons")
soup = BeautifulSoup(page.content, 'html.parser')
vendor = soup.find_all('div', class_="estoque-linha", mp="2")
print(vendor[1].find(class_='e-col1').find('img')['title'])
print(vendor[1].find(class_='e-col2').find_all(class_='ed-simb')[1].string)
print(vendor[1].find(class_='e-col5'))

EDIT: Hidden content stands for JavasSript dynamically updated content in this case.

It appears that the supposedly hidden content is actually dynamically updated with JavaScript. — Luke, Dec 25 '18 at 18:54
What is the proper way to parse this type of content @LukaszSalitra? — delirium, Dec 25 '18 at 19:02
@delirium in general case it's hard. In your specific case may want to look into JavaScript to see what it's doing and basically re-implement it in your parser. — rvs, Dec 25 '18 at 19:15

score 2 · Accepted Answer · answered Dec 25 '18 at 20:30

2

the unid is saved in JS array

vetFiltro[0]=["e3724364",0,1,....];

the 1 is the unid, you can get it with regex

# e-col5
unitID = vendor[1].get('id').replace('line_', '') # line_e3724364 => e3724364
regEx = r'"%s",\d,(\d+)' % unitID
unit = re.search(regEx, page.text).group(1)
print(unit + ' unids')

answered Dec 25 '18 at 20:30

ewwink

18,382
2
44
54

Thanks for the help! How did you found out that? Can I process any other JavaScript fields like that (e.g., price)? – delirium Dec 25 '18 at 20:42
unfortunately I can't find way to get the price. – ewwink Dec 25 '18 at 20:51
thanks anyway. Could you still comment on how you found out about `vetFiltro`? – delirium Dec 25 '18 at 20:55
1

every vendor has ID like `line_e3724364` with the `line_` removed I found it in the page source. and you're welcome. – ewwink Dec 25 '18 at 20:59

Fabian · Answer 2 · 2018-12-25T20:28:41.300

1

If you take a closer look the unid is just an image in a div moved by a class to the correct number.

For example unid 1:

.jLsXy {
    background-image: url(arquivos/up/comp/imgunid/files/img/181224lSfWip8i1lmcj2a520836c8932ewcn.jpg);
}

is the image containing numbers.

.gBpKxZ {
background-position: -424px -23px;
}

is the class for number 1

So find the matching css to the number and create your table ( easy way ) but not best way.

Edit: Seems like changing the position(class) each time reloaded so its more hard to match the number with the image :( so the number 1 could be taken from many places.

Edit2 I was using chrome devtools. If you inspect the unid you will find the css for each class aswell. So after checking the url it was clear.

edited Dec 25 '18 at 20:28

answered Dec 25 '18 at 20:02

Fabian

1,130
9
25

Thanks for your help :) ! How did you discover that the number is an image? – delirium Dec 25 '18 at 20:24
@delirium check second edit :) if you need more explanation just ask me :) – Fabian Dec 25 '18 at 20:29

0x48piraj · Answer 3 · 2018-12-25T21:04:37.540

@ewwink found out the way to pull out unid but was unable to pull out prices. I have tried to pull out prices in this answer.

Target div snippet:

<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
                            <div class="e-col4 e-col4-offmktplace">
                                <img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>

                            </div>
                        <div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz">&nbsp;</div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>

If we look closely, we can,

for item in soup.findAll('div', {"id": re.compile('^line')}):
 print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))

Output [truncated]:

['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]

It extracts major chunks, and we'll get the prices. But this also skips multiple items.

To get all the data, we can use OCR API and Selenium to accomplish this. We can capture elements of interest by using the following snippet :

from selenium import webdriver
from PIL import Image
from io import BytesIO

fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()

im = Image.open(BytesIO(png)) # uses PIL library to open image in memory

left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']


im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image

Took help from https://stackoverflow.com/a/15870708.

We can iterate like we did above using re.findall() to save all the images. After we have all the images, we can then use OCR Space to extract text data. Here's a quick snippet :

import requests


def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):

    payload = {'isOverlayRequired': overlay,
               'apikey': api_key,
               'language': language,
               }
    with open(filename, 'rb') as f:
        r = requests.post('https://api.ocr.space/parse/image',
                          files={filename: f},
                          data=payload,
                          )
    return r.content.decode()

e = ocr_space_file(filename='1.png')

print(e) # prints JSON

1.png :

JSON response from ocr.space :

{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}

It gives us, "ParsedText" : "RS 0',85 \r\n".

Nice work! How do I associate images with each listing? – delirium Dec 26 '18 at 10:25 — delirium, Dec 26 '18 at 10:25

BeautifulSoup: Parse JavaScript dynamic content

3 Answers3