What I'm doing:
I'm developing an "webscraper" (multithreaded), thats it, lol.
I need to submit a form before extract the data from the page, so the layout is this:
- GET request to example.com/path/doc.jsp (my data).
- Check if the confirmation form is present in the doc source. If yes continue to step 3 (my data is not present, need to submit the form first), else return (since there is no form to submit and my data is here).
- GET request to example.com/path/sub/other.jsp (necessary key value).
- POST request to example.com/path/submit.jsp (send values).
- Check the response from POST request, if ok go to 6, else back to 1.
- GET request to example.com/path/doc.jsp (my data, again. Since I submitted the form, now my data will be present).
Everything is working fine except if the response from POST request (step 4) tell me to go back to step 1.
The Problem:
One of the values in the form I need to extract it from the cookies, so I use the GetCookies()
function, but, like I said, if the response tell me to go back to step 1, ALL requests (both GET and POST) after that has missing cookies (and weird ones added). See the image below:
- The first call is the GET request to doc.jsp, where my data is.
- The second call is the other.jsp request, since the confirmation form is present in the doc.jsp source code.
- The third call is when I submit all values.
- The fourth call is the GET request to doc.jsp again, since the response of submit form (the third call) told me to repeat the process. Basically, 4º ~ 6º calls are the same of 1º ~ 3º, but with cookies fu**ed.
My Code:
public class CWeb : IDisposable
{
private WebClientEx _wc;
private string _originalUrl;
public CWeb()
{
_wc = new WebClientEx(new CookieContainer());
}
public string downloadPage(string url)
{
_originalUrl = url;
string pgSrc = "error";
int tries = 0;
while (tries < 3 && pgSrc == "error)
{
try
{
pgSrc = _wc.DownloadString(url);
}
catch (Exception err)
{
tries += 1;
pgSrc = "error";
...
}
}
if (needSubmit(pgSrc)) // needSubmit just peform IndexOf on pgSrc
do
{
pgSrc = sendForm(pgSrc);
} while (needSubmit(pgSrc));
return WebUtility.HtmlDecode(pgSrc);
}
public string sendForm(pageSource)
{
// 1- Get Cookie Value
string cookie = _wc.CookieContainer.GetCookies(new Uri(_originalUrl))["JSESSIONID"].Value;
// 2- Get hidden values in pageSource parameter
// skip this, since there's no web request here, only some html parsing
// with Html Agility Pack
...
// 3- Get key value
string tmpStr = _wc.DownloadString("http://example.com/path/sub/other.jsp");
... more html parsing ...
// 4- Build form
NameValueCollection nvc = new NameValueCollection();
nvc["param1"] = cookie;
nvc["param2"] = key;
...
// 5- Send
_wc.UploadValues("example.com/path/submit.jsp", nvc);
// 6- Return
return _wc.DownloadString(_originalUrl);
}
public void Dispose()
{
_wc.Dispose();
}
}
Main Program:
static void Main(string[] args)
{
// Load tons of 'doc' url list from database...
List<string> urls = new List<string>();
...
Parallel.ForEach(urls, (url) =>
{
using (CWeb crawler = new CWeb())
{
string pageData = crawler.downloadPage(url);
... parse html data here ...
}
});
}
My Enviroment:
- Using Visual Studio Professional 2013.
- Target Framework is .NET Framework 4.5.
- Platform x86 (debug).
- WebClientEx is an extended version of WebClient to work with cookies. Get it here PasteBin. I tried to implement the
BugFix_CookieDomain()
(from this question), but even with that fix, this problem still occur. All my url's include the http:// prefix.
Used Fiddler to see the requests information.
- English is not my native language... '-'