10

I want to submit a form from following web page: http://www.hzzo-net.hr/statos_OIB.htm

First, I use 2captcha service to bypass recaptcha:

# parameters
api_key <- "c+++"
api_url <- "http://2captcha.com/in.php"
site_key <- "6Lc3SAgUAAAAALFnYxUbXlcJ8I9grvAPC6LFTKQs"
hzzo_url <- "http://www.hzzo-net.hr/statos_OIB.htm"

# GET method
req_url <- paste0("http://2captcha.com/in.php?key=", api_key,"&method=userrecaptcha&googlekey=", 
                  site_key, "&pageurl=", hzzo_url)
get_response <- POST(req_url)
hzzo_content <- content(get_response)
hzzo_content <- xml_text(hzzo_content)
captcha_id <- stringr::str_extract_all(hzzo_content[[1]], "\\d+")[[1]]

# solve captcha
Sys.sleep(16L)
captcha2_solve <- function(apiKey, capstchaID){
  req_url <- paste0("http://2captcha.com/res.php?key=", api_key,"&action=get&id=", capstchaID)
  result <- GET(req_url)
  captcha_content <- content(result)
  hzzo_response <- xml_text(captcha_content)
  hzzo_response <- strsplit(hzzo_response, "\\|")
  return(hzzo_response)
  # hzzo_response <- hzzo_response[[1]][[2]]
  # return(hzzo_response)
}
hzzo_response <- captcha2_solve(api_key, captcha_id)
while(hzzo_response[[1]] == "CAPCHA_NOT_READY"){
  Sys.sleep(16L)
  hzzo_response <- captcha2_solve(api_key, captcha_id)
  return(hzzo_response)
}
hzzo_response <- hzzo_response[[1]][[2]]

After executing this code I got the response that I have enter to textarea of recaptcha. This part works fine, as I expected. The response looks like this:

"03AHqfIOmo9BlCsCKyg-lDes4oW-U3PWgCtATRUqXFcEV032acDgGoOzrV8GiZNDzCF4TbCVLcY8HZ8hR1JqO11YdRExvgPDL0EUsjCZdI0rUm_LnBRRifyb66X7V6r4n8CIm1si3EKmw36XIcZK7MGrHSNWRrj2aGzWAYO8ceobViOICOhkYe9Bsfv64tUHWvHSqNIoesD_FHplbWG3B0eMag5341NyycjpNLxgNCwVzA8mhCU3oQUcloze-mIclFMZ7J_nbVhXdy8-qipF5ZFH4xIhSQXHH-TqxyaGQFjKdgLch7MuDEQVRcQGo1o4QuSEoeCTjlPn3Mah5vC8zKrnqfbMgiOVOIDJFGvFY4KOivbBzYTz5nW9g"

After that, I should submit the form. This is the part I can't get right.

I tried to add all arguments to POST:

parameters <- list(
  'upoib' = "93335620125", # example of number to enter
  'g-recaptcha-response' = hzzo_response
)

test <- POST(
  "http://www.hzzo-net.hr/statos_OIB.htm",
  body = toJSON(parameters), 
  encode = "json",
  verbose()
)

but this just give me the initial page.

How can I submit the form if I have the recaptcha response variable? Is it possible to submit it with httr package or I have to use Selenium. The code can be in R or Python (just need the last part, POST function).

Zoe
  • 27,060
  • 21
  • 118
  • 148
Mislav
  • 1,533
  • 16
  • 37
  • I suggest you take a look at the requests (http://docs.python-requests.org/en/master/) package in Python. – KelvinS Aug 12 '18 at 16:44
  • I am familiar with httr package in R. I suppose it is similar as request package in python. I have already used it in code above, and got 200 return code. But the html source is not right. So, I am not sure reading the documentation would help (and I would need something like 3 days to study it from begin to the end). – Mislav Aug 12 '18 at 16:51
  • I have made a quick manual test and seems the post request is also expecting `x` and `y` as parameters (I don't know what they mean) beyond the `g-recaptcha-response` and `upoib`. Couldn't it be the problem? – KelvinS Aug 12 '18 at 17:01
  • When i used html_form from rvest pacjage in r it shiwed 2 inputs with no name that us "". Onecis sekect and is image. Select can be from 1 to 4 – Mislav Aug 12 '18 at 17:04

1 Answers1

11

If you inspect the HTML you'll see that the form's action is ../cgi-bin/statos_OIB.cgi, which means that the form is submitted to http://www.hzzo-net.hr/cgi-bin/statos_OIB.cgi, so you must use that URL.

Also, after some testing I discovered that the server returns a 500 response, unless a valid Referer (http://www.hzzo-net.hr/statos_OIB.htm) is provided.

I'm not familiar with R, but I can provide an example in Python, using the requests library.

import requests

url = "http://www.hzzo-net.hr/cgi-bin/statos_OIB.cgi"
hzzo_response = 'your token'
data = {
    'upoib': '93335620125', 
    'g-recaptcha-response': hzzo_response
}
headers = {'referer': 'http://www.hzzo-net.hr/statos_OIB.htm'}
r = requests.post(url, data=data, headers=headers)
html = r.text

print(html)

After studying the httr docs I managed to 'translate' the above code in R. The code produces correct results if a valid token is supplied.

library(httr)

url <- "http://www.hzzo-net.hr/cgi-bin/statos_OIB.cgi"
hzzo_response <- "your token"
parameters <- list(
  'upoib' = "93335620125", 
  'g-recaptcha-response' = hzzo_response
)
test <- POST(
  url,
  body = parameters, 
  add_headers(Referer = 'http://www.hzzo-net.hr/statos_OIB.htm'),
  encode = "form",
  verbose()
)
html <- content(test, 'text', encoding = 'UTF-8')

print(html)
daaawx
  • 3,273
  • 2
  • 17
  • 16
t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • Hi @t.m.adam, hope you are doing great. I've created a post [here](https://stackoverflow.com/questions/65403238/cant-find-the-right-way-to-grab-part-numbers-from-a-webpage-using-requests) which seems to be a bit tricky to solve, so I thought to let you know as you are very good at it. Thanks. – MITHU Dec 22 '20 at 06:43