0

What I'm doing:
I'm developing an "webscraper" (multithreaded), thats it, lol.
I need to submit a form before extract the data from the page, so the layout is this:

  1. GET request to example.com/path/doc.jsp (my data).
  2. Check if the confirmation form is present in the doc source. If yes continue to step 3 (my data is not present, need to submit the form first), else return (since there is no form to submit and my data is here).
  3. GET request to example.com/path/sub/other.jsp (necessary key value).
  4. POST request to example.com/path/submit.jsp (send values).
  5. Check the response from POST request, if ok go to 6, else back to 1.
  6. GET request to example.com/path/doc.jsp (my data, again. Since I submitted the form, now my data will be present).

Everything is working fine except if the response from POST request (step 4) tell me to go back to step 1.

The Problem:
One of the values in the form I need to extract it from the cookies, so I use the GetCookies() function, but, like I said, if the response tell me to go back to step 1, ALL requests (both GET and POST) after that has missing cookies (and weird ones added). See the image below:

Cookie Error
Image Explanation:

  • The first call is the GET request to doc.jsp, where my data is.
  • The second call is the other.jsp request, since the confirmation form is present in the doc.jsp source code.
  • The third call is when I submit all values.
  • The fourth call is the GET request to doc.jsp again, since the response of submit form (the third call) told me to repeat the process. Basically, 4º ~ 6º calls are the same of 1º ~ 3º, but with cookies fu**ed.


My Code:

public class CWeb : IDisposable
{
    private WebClientEx _wc;
    private string _originalUrl;

    public CWeb()
    {
        _wc = new WebClientEx(new CookieContainer());
    }

    public string downloadPage(string url)
    {
        _originalUrl = url;
        string pgSrc = "error";
        int tries = 0;

        while (tries < 3 && pgSrc == "error)
        {
            try
            {
                pgSrc = _wc.DownloadString(url);
            }
            catch (Exception err)
            {
                tries += 1;
                pgSrc = "error";
                ...
            }
        }

        if (needSubmit(pgSrc)) // needSubmit just peform IndexOf on pgSrc
            do
            {
                pgSrc = sendForm(pgSrc);
            } while (needSubmit(pgSrc));

        return WebUtility.HtmlDecode(pgSrc);
    }

    public string sendForm(pageSource)
    {
        // 1- Get Cookie Value
        string cookie = _wc.CookieContainer.GetCookies(new Uri(_originalUrl))["JSESSIONID"].Value;

        // 2- Get hidden values in pageSource parameter
        // skip this, since there's no web request here, only some html parsing
        // with Html Agility Pack
        ...

        // 3- Get key value
        string tmpStr = _wc.DownloadString("http://example.com/path/sub/other.jsp");
        ... more html parsing ...

        // 4- Build form
        NameValueCollection nvc = new NameValueCollection();
        nvc["param1"] = cookie;
        nvc["param2"] = key;
        ...

        // 5- Send
        _wc.UploadValues("example.com/path/submit.jsp", nvc);

        // 6- Return
        return _wc.DownloadString(_originalUrl);
    }

    public void Dispose()
    {
        _wc.Dispose();
    }
}


Main Program:

static void Main(string[] args)
{
    // Load tons of 'doc' url list from database...
    List<string> urls = new List<string>();
    ...

    Parallel.ForEach(urls, (url) =>
        {
            using (CWeb crawler = new CWeb())
            {
                string pageData = crawler.downloadPage(url);
                ... parse html data here ...
            }
        });
}


My Enviroment:

  • Using Visual Studio Professional 2013.
  • Target Framework is .NET Framework 4.5.
  • Platform x86 (debug).
  • WebClientEx is an extended version of WebClient to work with cookies. Get it here PasteBin. I tried to implement the BugFix_CookieDomain() (from this question), but even with that fix, this problem still occur.
  • All my url's include the http:// prefix.

  • Used Fiddler to see the requests information.

  • English is not my native language... '-'
Community
  • 1
  • 1
Kiritonito
  • 414
  • 1
  • 9
  • 19

1 Answers1

0

I use System.Net.WebRequest for something similar to what you are doing. It handles cookies when using Http (HttpWebRequest subclass of WebRequest) via a property called CookieContainer. I have noticed cookies being added and apparently removed from the cookie container as well. My belief is that this is entirely controlled by the server side (the web app you are making requests to). It is capable of adding additional cookies.

Further if cookies have an expiry date, a discard flag and a domain, so if the expire date elapses, the server sets a discard flag, or the domain changes the list of applicable cookies could change.

Not sure if this is helpful, but I try.

Jeremy
  • 44,950
  • 68
  • 206
  • 332