2

Scope:

I am developing a C# aplication to simulate queries into this site. I am quite familiar with simulating web requests for achieving the same human steps, but using code instead.

If you want to try yourself, just type this number into the CNPJ box: 08775724000119 and write the captcha and click on Confirmar

I've dealed with the captcha already, so it's not a problem anymore.

Problem:

As soon as i execute the POST request for a "CNPJ", a exception is thrown:

The remote server returned an error: (403) Forbidden.

Fiddler Debugger Output:

Link for Fiddler Download

This is the request generated by my browser, not by my code

POST https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco HTTP/1.1
Host: www.sefaz.rr.gov.br
Connection: keep-alive
Content-Length: 208
Cache-Control: max-age=0
Origin: https://www.sefaz.rr.gov.br
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko)    Chrome/23.0.1271.97 Safari/537.11
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Referer: https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco
Accept-Encoding: gzip,deflate,sdch
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: GX_SESSION_ID=gGUYxyut5XRAijm0Fx9ou7WnXbVGuUYoYTIKtnDydVM%3D;   JSESSIONID=OVuuMFCgQv9k2b3fGyHjSZ9a.undefined


//    PostData : 
_EventName=E%27CONFIRMAR%27.&_EventGridId=&_EventRowId=&_MSG=&_CONINSEST=&_CONINSESTG=08775724000119&cfield=rice&_VALIDATIONRESULT=1&BUTTON1=Confirmar&sCallerURL=http%3A%2F%2Fwww.sintegra.gov.br%2Fnew_bv.html

Code samples and References used:

I'm using a self developed library to handle/wrap the Post and Get requests.

The request object has the same parameters (Host,Origin, Referer, Cookies..) as the one issued by the browser (logged my fiddler up here).

I've also managed to set the ServicePointValidator of certificates by using:

ServicePointManager.ServerCertificateValidationCallback = 
    new RemoteCertificateValidationCallback (delegate { return true; });

After all that configuration, i stil getting the forbidden exception.

Here is how i simulate the request and the exception is thrown

        try
        {
            this.Referer = Consts.REFERER;

            // PARAMETERS: URL, POST DATA, ThrownException (bool)
            response = Post (Consts.QUERYURL, postData, true);
        }
        catch (Exception ex)
        {
            string s = ex.Message;
        }

Thanks in advance for any help / solution to my problem

Update 1:

I was missing the request for the homepage, which generates cookies (Thanks @W0lf for pointing me that out)

Now there's another weird thing. Fiddler is not showing my Cookies on the request, but here they are : CookieJar

Marcello Grechi Lins
  • 3,350
  • 8
  • 38
  • 72
  • 3
    what a lousy CAPTCHA system! – Cristian Lupascu Jan 03 '13 at 13:22
  • Can you post all the code that you use to build the request? the Fiddler data above is for a request generated by your program, or from the browser? – Cristian Lupascu Jan 03 '13 at 13:35
  • @W0lf The request issued by the browser. I guess there's a cookie missing, but i'm not sure where it was generated. Double checking it now, if it does not help, i will post the code – Marcello Grechi Lins Jan 03 '13 at 13:40
  • @W0lf Captcha validation is done via Javascript, Pff. Even if it was client side, each captcha is generated by an index from 1 to 191, which is, also, lousy. This is not a dinamically generated captcha. https://www.sefaz.rr.gov.br/sintegra/images/images/60.jpg is ALWAYS "debt" word. – Marcello Grechi Lins Jan 03 '13 at 17:43
  • 1
    The part of it not being dynamic is surprisingly not the most stupid thing about it. The worst part is that the validation is done in JS. If you basically pass `cfield=much&_VALIDATIONRESULT=1` to every request you should be fine. – Cristian Lupascu Jan 04 '13 at 07:59

2 Answers2

6

I made a successful request using the browser and recorded it in Fiddler.

The only things that differ from your request are:

  • my browser sent no value for the sCallerURL parameter (I have sCallerURL= instead of sCallerURL=http%3A%2F%2Fwww....)
  • the session ids are different (obviously)
  • I have other Accept-Language: values (I'm pretty sure this is not important)
  • the Content-Length is different (obviously)

Update

OK, I thought the Fiddler trace was from your application. In case you are not setting cookies on your request, do this:

  • before posting data, do a GET request to https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco. If you examine the response, you'll notice the website sends two session cookies.
  • when you do the POST request, make sure to attach the cookies you got at the previous step

If you don't know how to store the cookies and use them in the other request, take a look here.

Update 2

The problems

OK, I managed to reproduce the 403, figured out what caused it, and found a fix.

What happens in the POST request is that:

  • the server responds with status 302 (temporary redirect) and the redirect location
  • the browser redirects (basically does a GET request) to that location, also posting the two cookies.

.NET's HttpWebRequest attempts to do this redirect seamlessly, but in this case there are two issues (that I would consider bugs in the .NET implementation):

  1. the GET request after the POST(redirect) has the same content-type as the POST request (application/x-www-form-urlencoded). For GET requests this shouldn't be specified

  2. cookie handling issue (the most important issue) - The website sends two cookies: GX_SESSION_ID and JSESSIONID. The second has a path specified (/sintegra), while the first does not.

Here's the difference: the browser assigns by default a path of /(root) to the first cookie, while .NET assigns it the request url path (/sintegra/servlet/hwsintco).

Due to this, the last GET request (after redirect) to /sintegra/servlet/hwsintpe... does not get the first cookie passed in, as its path does not correspond.

The fixes

  • For the redirect problem (GET with content-type), the fix is to do the redirect manually, instead of relying on .NET for this.

To do this, tell it to not follow redirects:

postRequest.AllowAutoRedirect = false

and then read the redirect location from the POST response and manually do a GET request on it.

For this, the fix I found was to take the misplaced cookie from the CookieContainer, set it's path correctly and add it back to the container in the correct location.

This is the code to do it:

private void FixMisplacedCookie(CookieContainer cookieContainer)
{
    var misplacedCookie = cookieContainer.GetCookies(new Uri(Url))[0];

    misplacedCookie.Path = "/"; // instead of "/sintegra/servlet/hwsintco"

    //place the cookie in thee right place...
    cookieContainer.SetCookies(
        new Uri("https://www.sefaz.rr.gov.br/"), 
        misplacedCookie.ToString());
}

Here's all the code to make it work:

using System;
using System.IO;
using System.Net;
using System.Text;

namespace XYZ
{
    public class Crawler
    {

        const string Url = "https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco";

        public void Crawl()
        {
            var cookieContainer = new CookieContainer();

            /* initial GET Request */
            var getRequest = (HttpWebRequest)WebRequest.Create(Url);
            getRequest.CookieContainer = cookieContainer;
            ReadResponse(getRequest); // nothing to do with this, because captcha is f#@%ing dumb :)

            /* POST Request */
            var postRequest = (HttpWebRequest)WebRequest.Create(Url);

            postRequest.AllowAutoRedirect = false; // we'll do the redirect manually; .NET does it badly
            postRequest.CookieContainer = cookieContainer;
            postRequest.Method = "POST";
            postRequest.ContentType = "application/x-www-form-urlencoded";

            var postParameters =
                "_EventName=E%27CONFIRMAR%27.&_EventGridId=&_EventRowId=&_MSG=&_CONINSEST=&" +
                "_CONINSESTG=08775724000119&cfield=much&_VALIDATIONRESULT=1&BUTTON1=Confirmar&" +
                "sCallerURL=";

            var bytes = Encoding.UTF8.GetBytes(postParameters);

            postRequest.ContentLength = bytes.Length;

            using (var requestStream = postRequest.GetRequestStream())
                requestStream.Write(bytes, 0, bytes.Length);

            var webResponse = postRequest.GetResponse();

            ReadResponse(postRequest); // not interested in this either

            var redirectLocation = webResponse.Headers[HttpResponseHeader.Location];

            var finalGetRequest = (HttpWebRequest)WebRequest.Create(redirectLocation);


            /* Apply fix for the cookie */
            FixMisplacedCookie(cookieContainer);

            /* do the final request using the correct cookies. */
            finalGetRequest.CookieContainer = cookieContainer;

            var responseText = ReadResponse(finalGetRequest);

            Console.WriteLine(responseText); // Hooray!
        }

        private static string ReadResponse(HttpWebRequest getRequest)
        {
            using (var responseStream = getRequest.GetResponse().GetResponseStream())
            using (var sr = new StreamReader(responseStream, Encoding.UTF8))
            {
                return sr.ReadToEnd();
            }
        }

        private void FixMisplacedCookie(CookieContainer cookieContainer)
        {
            var misplacedCookie = cookieContainer.GetCookies(new Uri(Url))[0];

            misplacedCookie.Path = "/"; // instead of "/sintegra/servlet/hwsintco"

            //place the cookie in thee right place...
            cookieContainer.SetCookies(
                new Uri("https://www.sefaz.rr.gov.br/"),
                misplacedCookie.ToString());
        }
    }
}
Community
  • 1
  • 1
Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
  • I'm using a CookieJar on my Requests Library that managed to save all cookies. I've just executed the request for the HomePage, as you sugested, still no luck, the cookies are there though. – Marcello Grechi Lins Jan 03 '13 at 13:50
  • The cookies are found in code, but fiddler is not showing them. Fiddler shows no cookies on the requests issued by my code – Marcello Grechi Lins Jan 03 '13 at 13:54
  • @MarcelloGrechiLins if Fiddler does not show the cookies, it means that they are probably not sent. Double-check your code to make sure you attach them to the second request. – Cristian Lupascu Jan 03 '13 at 13:56
  • @MarcelloGrechiLins it was a tricky problem, but I think I solved it. Please see the update to my answer. – Cristian Lupascu Jan 03 '13 at 16:28
  • Thanks. That solved my problem. Would you mind telling me the steps you've used to troubleshot it ? – Marcello Grechi Lins Jan 03 '13 at 16:57
  • 1
    I just used Fiddler to compare the requests done by browser vs the ones done by the program. After noticing the differences, I came up with the fixes I described above. – Cristian Lupascu Jan 03 '13 at 17:02
0

Sometimes HttpWebRequest needs proxy initialization: request.Proxy = new WebProxy();//in my case it doesn't need parameters, but you can set it to your proxy address

user2095816
  • 31
  • 1
  • 2