0

Here is my code:

import bs4 as bs
from urllib.request import urlopen

page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/").read()

soup = bs.BeautifulSoup(page, "lxml")

div_lista_locacao = soup.select("div#lista-locacao")[0]

ul_tags = list(div_lista_locacao.children)

print("ul_tags = ",ul_tags)

(You can see I printed a list containing the children of the div_lista_locacao).

The output:

ul_tags =  ['\n']

(And it only shows a line break, even though there are actual children to it as you can see below).

This is the HTML of my source:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" style="" class=" js flexbox flexboxlegacy canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths"
  lang="pt">
<head></head>
<body id="topo_geral" itemscope="" itemtype="http://schema.org     
   /WebPage">
  <div id="container-hero" class="container-fluid"></div>
  <div id="resultado" class="container-fluid page-container">
    <!-- DESKTOP -->
    <div id="banner-resultado" class="col col-xs-12 col-sm-12 col-
       md-12col-lg-12 text-center hide"></div>
    <div class="row hidden-xs hidden-sm">
      <div class="col col-xs-12 col-sm-12 col-md-3 col-lg-3 filtro-  
         resultado"></div>
      <div class="col col-xs-12 col-sm-12 col-md-9 col-lg-9 box-
         resultado-hidden-xs hidden-sm"></div>
      <button id="btn-ordenacao-por-valor" data-ordenar="asc" class="btnbtn-valor btn-branco"></button>
      <ul class="nav nav-tabs" role="tablist" id="myTab"></ul>
      <div class="tab-content">
        <div role="tabpanel" class="tab-pane active" id="locacao">
          #Currently manipulating this tag beneath. This is the "div_lista_locacao" variable.
          <div id="lista-locacao" class="col col-xs-12 col-sm-12 col-
            md-12 col-lg-12 nopadmar">
            ##Need to iterate between these 'ul' tags beneath and parse the text internally.
            ## But they won't show up in the .children list.
              <ul class="ul-resultado paginacao paginacao_numero_1" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_2" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_3" style="display: none;"></ul>
          </div>
        </div>
      </div>
    </div>
  </div>
</body>

</html>

##I can reply with the contents inside the 'ul' tags if requested. 
##But I just thought it wouldn't be necessary for this particular question.

I'm using "lxml" to parse it, but I've already tried changing it to "html.parser","html5lib" and "xml". All giving similar results.

So, is it the parser? Is it the library I used to download the web page? Did it not download this section? Or maybe a BS bug? IDK.

3 Answers3

3

As already mentioned in an answer by @facelessuser, the content is loaded dynamically with Javascript.

The good news is that you can make the same ajax request via python and get the json response. This contains all the data that you require. I am just printing out the price.

import bs4 as bs
from urllib.request import urlopen
import json
page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/?pagina=1&busca=%7B%22valorMinimo%22%3Anull%2C%22valorMaximo%22%3Anull%2C%22quartos%22%3Anull%2C%22suites%22%3Anull%2C%22banhos%22%3Anull%2C%22vagas%22%3Anull%2C%22idadeMinima%22%3Anull%2C%22areaMinima%22%3Anull%2C%22areaMaxima%22%3Anull%2C%22bairros%22%3A%5B%22santo-antonio%22%5D%2C%22ordenar%22%3Anull%7D&outrasPags=true&quantidadeDeRegistro=20&first=false").read()
properties=json.loads(page)['lista']
for item in properties:
    print(item['valorLocacaoFormat'])

Output

R$ 1.490,00
R$ 2.300,00
R$ 1.480,00
R$ 1.600,00
R$ 1.700,00
R$ 2.100,00
R$ 1.600,00
...

Note: To find the ajax url that I am using, open the network tab in you browser developer tools and go to the url. You can see the xhr request being made.

enter image description here

Bitto
  • 7,937
  • 1
  • 16
  • 38
  • Maybe a stupid question, but still: regarding your end note: when I open the network tab (Firefox), I see 12 `GET`s. How can you, in principle, tell which one contains the relevant url? Thanks. – Jack Fleeting Feb 24 '19 at 11:32
  • 1
    @JackFleeting You can further isolate it by clicking on the XHR sub tab in firefox under the network tab. If you click on any of the request, that it will show you a side window which has a "response" tab which shows you the response. I have included a screenshot to the answer. The ajax url is in the "Headers" tab of the same side window. – Bitto Feb 24 '19 at 13:46
  • Bitto Bennichan: Learned something new today; thanks! – Jack Fleeting Feb 24 '19 at 14:14
0

I think the ul content is loaded dynamically with JavaScript after the page loads. Running your script, and printing out div_lista_locacao, I get:

[<div class="col col-xs-12 col-sm-12 col-md-12 col-lg-12 nopadmar" id="lista-locacao">
</div>]

As you can see, there is no ul elements to select in that div. You may need to use something like selenium to get the dynamic content, and then select the uls once you get the full HTML, but using only requests is not sufficient as you must execute JavaScript to load the lists into the div element first.

facelessuser
  • 1,656
  • 1
  • 13
  • 11
0

The content is loaded dynamically with Javascript as @facelessuser and @Bitto said.If you go to page then click view-source and search your id you don't see any ul .

In this case using selenium is more powerful to get elements from javascript.

If you didn't install driver you can install in http://chromedriver.chromium.org/getting-started

All code :

from selenium import webdriver


options = webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_options=options,
                          executable_path=r'/Users/omertekbiyik/PycharmProjects/bitirme/chromedriver')
driver.get('https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/')

x = driver.find_elements_by_css_selector("div[id='lista-locacao']")

for a in x:
    print a.text


driver.close()

OUTPUT :

partamento para alugar de 3 quartos
Santo Antônio - Rua Engenheiro Zoroastro Torres, 149
More na região nobre do Santo Antônio! Local tranquilo com comércio próximo, esquina com Av. Prudente de Moraes. Prédio familiar com 08 andares e 02 elevadores, 02 aptos por andar,
3
quartos
2
suítes
3
banhos
2
vagas
R$ 1.490
condomínio: R$ 1100
código: 724362
96 m²
Apartamento para alugar de 3 quartos
Santo Antônio - Rua Paulo Afonso, 587
ALUGUE SEM FIADOR pelo melhor preço: 1 + 11 parcelas de R$ 292,50**Mediante aprovação de ficha cadastral do locatário pela seguradoraO seu próximo lar na melhor localização do bair
3
quartos
1
suíte
2
banhos
2
vagas
R$ 2.300
condomínio: R$ 1452
código: 677116
175 m²
...UP TO FINISH ALL UL TAGS

And you can see all html part in div like

for a in x:
    print a.get_attribute('innerHTML')
Omer Tekbiyik
  • 4,255
  • 1
  • 15
  • 27