0

I am scraping website which is made on websphere.

I see that whenever the user logged in, It hits 4 url while reaching to home page.

While in 3rd URL, It has some encrypted value which looks like this

 L0lDU0NTSUpKZ2tLQ2xFS0NXXXXXXXXXXXXXXXXXXX..XXXXXXXXXvZD1vbkxvYWQ!

The URL looks like this :

   http://example.com/escares/wps/myportal/!ut/p/c1/XXXXXXXXXX/dl2/d1/L0lDU0NTSUpKZ2tLQ2xFS0NXXXXXXXXXXXXXXXXXXX..XXXXXXXXXvZD1vbkxvYWQ!

The problem is, I noticed this only encrypted value changes for every login.

Is there any algorithm in websphere that generates this kind of url ? Or is there any way I can replicate this encrypted value ?

Is there any one who has done crawling/scraping on the websphere site ?

2 Answers2

1

wps/myportal suggests a Websphere web portal login. The 'encrypted' URI you're seeing is most likely a hash to maintain the user login sessions.

The best way to replicate this is to supply your web scraping program with a username and password to access the portal section of the website so it can POST a login while scraping. The website itself will generate the session info. You will need to instruct your scraping application to follow any dynamic URLs that are generated. Usually this is done by following any URLs in the HTML supplied by the server after logging in.

As an example, scrapy can be configured to follow any URLs in target pages when scraping:

https://doc.scrapy.org/en/latest/intro/tutorial.html#following-links

Although you are using your own solution to scrape the contents of the portal for a logged in user, hopefully the logic and progression illustrated in my examples help steer you in the right direction for resolving what appears to be a session/cookie storage issue.

  • Actually I do this via http requests. So what I do is : I logged in to the website with form-urlencoded request. It generates token and session also. But website uses 2-3 urls before going to the home page. The problem is that : Those urls has that hashed ids.And I tried to figured it out that are they coming in response. But No There were not. Is there any other way ? – Manish Gadhock Jul 04 '18 at 14:17
  • So there are 2-3 HTTP redirects before arriving at the home page? Is there any data at these URLs that needs to be captured? Typically these redirects are responsible for setting a cookie on the client side to keep track of the session. If you need to keep track of the session using your scraper, consider how scrapy can keep track of a session cookie for each spider that runs: https://stackoverflow.com/questions/4981440/scrapy-how-to-manage-cookies-sessions?noredirect=1&lq=1 – Chris Slothouber Jul 04 '18 at 14:41
  • Can you tell me what scraping software you are using? Or are you attempting to write your own? I am providing examples from a well documented web scraping framework called scrapy because it is very easy to use and has solved a lot of the common challenges like the one you are facing. – Chris Slothouber Jul 04 '18 at 14:44
  • I am attempting this on my own. Also, I do get request to these 2-3 redirects. All cookies in the website are getting set by this. But What I am not able to to get is : the url with that hash. As it keeps on changing for new login. – Manish Gadhock Jul 04 '18 at 20:36
  • Hashes are by definition unique, and are what links the cookie stored on the client with the session information on the server. The WebSphere server handles creating the hash, so you'll need to capture it somehow in your code. Have you taken a look at the content of the new cookie when you log in and compared it with the URL? Each authenticated session is going to have a unique hash as the session identifier, and your scraping project will need to handle associating the session hash and/or session cookie so it may access the resources visible to an authenticated client. – Chris Slothouber Jul 04 '18 at 21:03
1

Though Chris has answered the question and it helped me.

This line

Usually this is done by following any URLs in the HTML supplied by the server after logging in.

Just want to update with Node js. The same thing can be acheived by request module and cheerio for parsing the html(which comes in response) in Node JS.

P.S. : In case anyone is looking where i found that dynamic url, I found that in HTML form which came to me in response. It was the action of that form.