0

I'm working on my first web crawler, and I'm trying to get some data of telephone numbers in Mexico, and the website that provides the data is: site, it works with xhr requests. I have this code so far:

from requests import Request, Session
import xml.etree.ElementTree as ET
import requests
import lxml.etree as etree

url = 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'

s = Session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html; charset=UTF-8',
}

str1 = s.post(url, headers=headers) #Loading the page
xhtml=str1.text.encode('utf-8')

#Savig the first response, to get the ViewState
text_file = open("loaded.txt", "w")
text_file.write(xhtml)
text_file.close()
x = ET.fromstring(xhtml)

namespace = "{http://www.w3.org/1999/xhtml}"
path = './/*[@id="javax.faces.ViewState"]'

e = x.findall(path.format(namespace))
for i in e:
    VS = i.attrib['value'] #ViewState

print VS #ViewState

At this point I get the ViewState of the page, now I send a new POST with the data and the number I want to consult plus the ViewState.

data = {
    "javax.faces.partial.ajax": "true",
    "javax.faces.source": "FORM_myform:BTN_publicSearch",
    "javax.faces.partial.execute": "@all",
    "javax.faces.partial.render": "FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable",
    "FORM_myform:BTN_publicSearch": "FORM_myform:BTN_publicSearch",
    "FORM_myform": "FORM_myform",
    "FORM_myform:TXT_NationalNumber": "6564384757",
    "javax.faces.ViewState=": VS #ViewState
}

req = s.post(url, data=data, headers=headers)
#Saving the new response, this is supposed to bring the results
text_file = open("Output.txt", "w")
text_file.write(req.text.encode('utf-8'))
text_file.close()

The thing is that the response I get is the full code of the page without the information, and I noticed that it comes with a new ViewState, I believe that's why is not consulting the data. Also I don't want to use selenium because I don't have a graphic interface in the server, and I need to consult a lot of numbers daily.

...UPDATE... I believe that the problem relies on JSF, need to know how to handle the data and the JSF values.

Neto A
  • 11
  • 5
  • @ Net A, it will be better if can you provide with a search input which is capable of producing results? – SIM Nov 22 '17 at 09:36
  • @Shahin I'm sorry, for to being clear at all. In the textbox "Numero Nacional" place the number "6564384757", and the info that I need is located at the bottom of the first table. "Proveedor de telefonia que atiende el numero". In this case the value is "AXTEL". – Neto A Nov 22 '17 at 16:05
  • @ Neto A, even after putting the number in the right box, the search button is still grayed out. So, i can't make use of it. See the link https://www.dropbox.com/s/y9zfzpsdao9kup5/Untitled.jpg?dl=0 – SIM Nov 22 '17 at 16:26
  • @Shahin I've noticed that the button works with a POST on every keypress, so copy-paste doesn't work. Try to delete the last 2 digits and then type them yourself one by one. – Neto A Nov 22 '17 at 16:37
  • The Problem does not 'rely' on JSF, JSF is working fine. You first need to do a 'get'. Retrieve the viewstate field and in subsequent posts send it to... !!! See https://stackoverflow.com/questions/12175763/how-to-programmatically-send-post-request-to-jsf-page-without-using-html-form – Kukeltje Nov 23 '17 at 19:03
  • I do, but when I send the POST with the ViewState, the page that returns comes with a diferent ViewState. So I'm trying the solution that this thread is proposing. https://stackoverflow.com/questions/8623870/how-can-i-programmatically-upload-a-file-to-a-website/ – Neto A Nov 23 '17 at 19:32
  • @BalusC Don't think that is duplicated, because this question is in Python. A have not received an answer to my question on python yet. – Neto A Nov 24 '17 at 19:14

2 Answers2

0

In order to use requests to get the data off of a website, you must have this...

r = requests.get(url)

Then after that I would print the results that the 'r' variable gets like so...

print (r)

And then I would use a for loop and treat the text outputted like array (r[0]) and check all of the text for anything that may look like a phone number. This is just one of the ways that you can do what you are trying to do with your web crawler, and it doesn't use xml at all.

So in all, my code would look like this...

import requests

url = "myurl"
r = requests.get(url)
counter = 0
length = len(r)
while counter != length:
    if r[counter] == '1' or r[counter] == '2' or r[counter] == '3' or r[counter] == '4' or r[counter] == '5'or r[counter] == '6' or r[counter] == '7' or r[counter] == '8' or r[counter] == '9' or r[counter] == '0':
        data = r[counter:counter+12]
        print (data)
    counter += 1
  • I'm supposed to enter the number, and the page give me information of the number like: carrier, city, etc. That's why I use POST to send the number and I actually receive an XML, you can see it of the page is loaded: – Neto A Nov 21 '17 at 20:36
  • Well is there even any info on the phone numbers that the site provides on that. And even if there is that might not be info that they just put on their site. So you might not even be able to use that with a web crawler; –  Nov 21 '17 at 20:40
  • I have a list of phones, and I need to know the carrier company of each one of those, for example if I put my cellphone number there, it says that the carrier company is AT&T. The same for other companies. – Neto A Nov 21 '17 at 20:49
0

You should try with curl, something like

#!/bin/bash

CURL='/usr/bin/curl --connect-timeout 5 --max-time 50'
URL='https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'
CURLARGS='-sD - -j'
NUM='6564193195'
c_FRONTAPPID="$($CURL $CURLARGS $URL)"
arr=($c_FRONTAPPID)

i=0
for var in "${arr[@]}"
do
  if [[ $var == *"FRONTAPPID="* ]]; then
        FRONTAPPID=$(echo "$var" | sed 's/.*FRONTAPPID=\(.*\);.*/\1/' | sed 's/!/"'"'"'!'"'"'"/g')
        #echo $var
        #echo $FRONTAPPID       
  fi
  if [[ $var == *"id=\"javax.faces.ViewState\""* ]]; then
        VIEWSTATE=$(echo ${arr[i+1]} | sed 's/.*"\(.*\)".*/\1/')
        #echo ${arr[i+1]}
        #echo $VIEWSTATE
  fi
  ((i++))
done

($CURL 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml' -X POST -H 'Host: sns.ift.org.mx:8081' -H 'Accept: application/xml, text/xml, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0' --compressed -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Faces-Request: partial/ajax' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'  -H "Cookie: FRONTAPPID=$FRONTAPPID" -H 'Connection: keep-alive' --data "javax.faces.partial.ajax=true&javax.faces.source=FORM_myform:BTN_publicSearch&javax.faces.partial.execute=@all&javax.faces.partial.render=FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable&FORM_myform:BTN_publicSearch=FORM_myform:BTN_publicSearch&FORM_myform=FORM_myform&FORM_myform:TXT_NationalNumber=$NUM&javax.faces.ViewState=$VIEWSTATE" )
MrCow
  • 1