9

I'm at a loss at how to get this sample code working, and I was hoping if someone is able to review and assess my assumptions as to what mat be wrong.

Problem: I would like to use Matlab to access a webpage that is protected by a login screen. I am able to use wget and it works fine, however as we know, wget does not load ajax/javascript etc. embedded within the page. Therefore, I have turned to using urlread2 function available from the Matlab File Exchange. Hereafter, all examples are based on this function.

Example:

I am trying to login to a financial website, however upon testing with other sites I get the same error. Therefore, for my example I am going to use fitbit.com. To mimimic the behaviour of a browser, I pass the following combined headers into urlread2 (I have split the code to make it easier to see what I'm doing):

value = 'https://www.fitbit.com';
header = http_createHeader('Host',value);
value = 'keep-alive';
header2 = http_createHeader('Connection',value);
value = '278';
header3 = http_createHeader('Content-Length',value);
value = 'max-age=0';
header4 = http_createHeader('Cache-Control',value);
value =     'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';
header5 = http_createHeader('Accept',value);
value = 'https://www.fitbit.com';
header6 = http_createHeader('Origin',value);
value = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36';
header7 = http_createHeader('User-Agent',value);
value = 'application/x-www-form-urlencoded';
header8 = http_createHeader('Content-Type',value);
value = 'https://www.fitbit.com/login';
header9 = http_createHeader('Referer',value);
value = 'gzip, deflate';
header10 = http_createHeader('Accept-Encoding',value);
value = 'en-US,en;q=0.8';
header11 = http_createHeader('Accept-Language',value);
%Generate a combined header as required by urlread2
combined_header = [header header2 header3 header4 header5 header6 header7 header8 header9 header10 header11];

With the header information defined, I generate the query string required (this is for the post operation):

queryString = 'email=myemail&password=mypassword&login=Log+In';  

Finally, bring it all together for the urlread2 function:

[output,extras] = urlread2('https://www.fitbit.com/login','post',queryString,combined_header);

The following response is embedded within the HTML:

'The owner of this website (www.fitbit.com) has banned your access based on your browser''s signature (2659bb18cf10354e-ua21).'

Possible problem 1:

It may well be that I'm passing in the header incorrectly, however when I mimic the headers via FireFox the page works correctly. Any advice on this would be greatly appreciated.

Possible problem 2:

I think the problem may be down to cookies, with the urlread2 (nor any other function in Matlab) supporting cookies. If this is the case, does anyone have any suggestions on how to tackle this?

dgoverde
  • 110
  • 2
  • 15
Dan
  • 449
  • 3
  • 16
  • 1
    Have you Googled your error message? [This question/answer](http://stackoverflow.com/q/24913699/2278029) seem to indicate that the issue is with your User Agent string. – horchler Jan 31 '16 at 19:04
  • I don't believe it is relating to the User Agent - the agent settings I have valided in developer mode via FireFox. – Dan Feb 01 '16 at 11:36
  • Calling it an issue with the User Agent string is a bit of a simplistic interpretation of the linked question, which should give you other ideas about the problem. Fundamentally, CloudFlare is refusing to take the User Agent you're declaring on face value, which is correct because that is not the UA you're using. Depending on their exact strategy, you might be able to change your request so it does a more believable impersonation of Firefox, or get accepted declaring a different User Agent, or you might have no way of accessing the site without running their JavaScript-based check – Will Feb 03 '16 at 10:24
  • 1
    @Dan - if you're using `urlread2` and trying to access a site that uses HTTPS, you might have to deal with the problem that I encountered myself. I just posted a [question/answer](http://stackoverflow.com/questions/35447683/matlab-how-to-get-urlread2-to-work-with-https/35447684#35447684) describing how I managed to get `urlread2` to work with HTTPS. – dgoverde Feb 17 '16 at 04:05

1 Answers1

1

The problem isn't your User Agent. I was able to verify that by trying a handful of User Agent values that should have worked. Instead, the problem is what you described as Problem 2. In other words, CloudFlare requires your HTTP header to contain a valid cookie value/name pair.

This is the line of the urlread2 output that tells me that is the case:

<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data- translate="enable_cookies">Please enable cookies.</div>

To see what cookies fitbit.com is using, add the View Cookies Add-on to Firefox. By my count, the login page sets 36 cookies, and my guess is that you will be barred entry if you're missing at least some of them. One thing you could do is just take the cookie values from your browser and manually add them to your HTTP header with a name/value pair, but it would be better to let the website set your cookies in a PHP script. Here is a Stack Overflow post that describes how that would work: How can I scrape website content in PHP from a website that requires a cookie login? Not easy, but definitely not impossible. Let me know if you need any more help.

Community
  • 1
  • 1
dgoverde
  • 110
  • 2
  • 15