0

I have been trying for a few days (using other answers on this site and MathWorks ) to get around the crumb that Yahoo Finance add at the end of a link to download a CSV file, e.g. for a CSV with Nasdaq100 data in a Chrome browser you would get the link: https://query1.finance.yahoo.com/v7/finance/download/%5ENDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G (by clicking on the "Download Data" button on this Yahoo Finance page).

This crumb=dnhBC8SRS9G obviously changes depending on Cookies and User Agent so I have tried to configure MATLAB accordingly to disguise myself as a Chrome browser (copying the cookie/user agent found in Chrome):

useragent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36';

cookie ='PRF=t%3D%255ENDX; expires=Thu, 11-Jun-2020 09:06:31 GMT; path=/; domain=.finance.yahoo.com';

opts = weboptions('UserAgent',useragent,'KeyName','WWW_Authenticate','KeyValue','dnhBC8SRS9G','KeyName','Cookie','KeyValue',cookie)

url = 'https://query1.finance.yahoo.com/v7/finance/download/^NDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G' ;

response = webread(url,opts)

But no matter what I do (using either webread or the extra function urlread2), I get the response that I am "unauthorized." The MATLAB code above gives me the response:

Error using readContentFromWebService (line 45)
The server returned the status 401 with message "Unauthorized" in response to the request to URL
https://query1.finance.yahoo.com/v7/finance/download/%5ENDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G.

Error in webread (line 122)
[varargout{1:nargout}] = readContentFromWebService(connection, options);

Error in TEST2 (line 22)
response = webread(url,opts)

Any help would be much appreciated, I just want to get the basics to work even if means that I manually have to copy the crumb from the Chrome-browser into MATLAB before the first request. (I saw that they solved it in Python, C#, etc. and I followed those solutions as much as possible, so it should be doable in MATLAB too, right?)

EDIT: If it is of any help, when I run urlread2 instead of webread at the end of my code, i.e:

[output,extras] = urlread2(url,'GET');
extras.firstHeaders

I get the following output from MATLAB:

ans = 

  struct with fields:

                   Response: 'HTTP/1.1 401 Unauthorized'
     X_Content_Type_Options: 'nosniff'
           WWW_Authenticate: 'crumb'
               Content_Type: 'application/json;charset=utf-8'
             Content_Length: '136'
                       Date: 'Tue, 12 Jun 2018 13:07:38 GMT'
                        Age: '0'
                        Via: 'http/1.1 media-router-omega4.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-ncache-api17.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-ncache-api15.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-router-api12.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), https/1.1 e3.ycpi.seb.yahoo.com (ApacheTrafficServer [cMsSf ])'
                     Server: 'ATS'
                    Expires: '-1'
              Cache_Control: 'max-age=0, private'
  Strict_Transport_Security: 'max-age=15552000'
                 Connection: 'keep-alive'
                  Expect_CT: 'max-age=31536000, report-uri="http://csp.yahoo.com/beacon/csp?src=yahoocom-expect-ct-report-only"'
Public_Key_Pins_Report_Only: 'max-age=2592000; pin-sha256="2fRAUXyxl4A1/XHrKNBmc8bTkzA7y4FB/GLJuNAzCqY="; pin-sha256="2oALgLKofTmeZvoZ1y/fSZg7R9jPMix8eVA6DH4o/q8="; pin-sha256="Gtk3r1evlBrs0hG3fm3VoM19daHexDWP//OCmeeMr5M="; pin-sha256="I/Lt/z7ekCWanjD0Cvj5EqXls2lOaThEA0H2Bg4BT/o="; pin-sha256="JbQbUG5JMJUoI6brnx0x3vZF6jilxsapbXGVfjhN8Fg="; pin-sha256="SVqWumuteCQHvVIaALrOZXuzVVVeS7f4FGxxu6V+es4="; pin-sha256="UZJDjsNp1+4M5x9cbbdflB779y5YRBcV6Z6rBMLIrO4="; pin-sha256="Wd8xe/qfTwq3ylFNd3IpaqLHZbh2ZNCLluVzmeNkcpw="; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="cAajgxHlj7GTSEIzIYIQxmEloOSoJq7VOaxWHfv72QM="; pin-sha256="dolnbtzEBnELx/9lOEQ22e6OZO/QNb6VSSX2XHA3E7A="; pin-sha256="i7WTqTvh0OioIruIfFR4kMPnBqrS2rdiVPl/s2uC/CY="; pin-sha256="iduNzFNKpwYZ3se/XV+hXcbUonlLw09QPa6AYUwpu4M="; pin-sha256="lnsM2T/O9/J84sJFdnrpsFp3awZJ+ZZbYpCWhGloaHI="; pin-sha256="r/mIkG3eEpVdm+u/ko/cwxzOMo1bk4TyHIlByibiA5E="; pin-sha256="uUwZgwDOxcBXrQcntwu+kYFpkiVkOaezL0WYEZ3anJc="; includeSubdomains; report-uri="http://csp.yahoo.com/beacon/csp?src=yahoocom-hpkp-report-only"'

And my weboptions output is:

opts = 

  weboptions with properties:

  CharacterEncoding: 'auto'
          UserAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
            Timeout: 5
           Username: ''
           Password: ''
            KeyName: ''
           KeyValue: ''
        ContentType: 'auto'
      ContentReader: []
          MediaType: 'application/x-www-form-urlencoded'
      RequestMethod: 'auto'
        ArrayFormat: 'csv'
       HeaderFields: {'Cookie'  'PRF=t%3D%255ENDX; expires=Thu, 11-Jun-2020 09:06:31 GMT; path=/; domain=.finance.yahoo.com'}
CertificateFilename: '/opt/matlab/r2017a/sys/certificates/ca/rootcerts.pem'
litmus
  • 107
  • 12
  • You are using KeyName KeyValue to add multiple headers, however it appears that only one header is being added. I believe that what you really want to use is the Header Fields property of weboptions. Check the documentation for weboptions for more information on headerfields. – Paolo Jun 12 '18 at 10:24
  • Try this: `opts.HeaderFields = {'WWW_Authenticate' 'dnhBC8SRS9G';'Cookie' 'PRF=t%3D%255ENDX; expires=Thu, 11-Jun-2020 09:06:31 GMT; path=/; domain=.finance.yahoo.com'}` – Paolo Jun 12 '18 at 10:32
  • thanks for your reply! I tried it but still no difference :/ – litmus Jun 12 '18 at 13:08
  • 1
    The HeaderFields is empty, so it has not added the headers successfully – Paolo Jun 12 '18 at 13:15
  • Ah, thanks! I will fix that! – litmus Jun 12 '18 at 13:17
  • Try with my code. Also I am not sure about your authentication header, authentication scheme should be one of the [following](http://www.iana.org/assignments/http-authschemes/http-authschemes.xhtml) ? – Paolo Jun 12 '18 at 13:17
  • Ok, I fixed the HeaderFields (see my edit)! And I see your point about the WWW_Authenticate, Yahoo actually replies with a "WWW_Authenticate: 'crumb'" so I should probably change it from 'dnhBC8SRS9G' to 'crumb.' – litmus Jun 12 '18 at 13:26
  • Not quite sure that's going to fix it. Why is authentication required in the first place, do you need to have an account? In that case, are there no keys etc which are given to you for http requests? – Paolo Jun 12 '18 at 13:35
  • Yeah, I realized this too and removed the "WWW_Auth..." since only the cookie is needed. Where would I find the Keys? – litmus Jun 12 '18 at 13:46
  • 1
    It would be under your account settings somewhere, however looking at the [link](https://stackoverflow.com/a/44050039/3390419) you shared it looks like you don't actually need any keys. All you need according to that example is the URL with the crumb value appended (as you already have) and the header cookie/value in weboptions. Also your 'Cookie' perhaps should just be 'cookie', not sure about case sensitivity... – Paolo Jun 12 '18 at 13:50
  • Ok, tried 'cookie,' but no luck.. I've also played around and added/ removed/changed the cookies, still nothing – litmus Jun 12 '18 at 14:12

3 Answers3

2

Okay, did some playing around with this with Curl and it appears that what you are trying to do is not possible at that specified URL. Worth noting is that the crumb and the cookie change often, so I had to parse the response of the two GET requests every time I ran the script to get the their values.

I'll walk you through my attempt.

  1. GET request and save cookie file.
  2. Parse cookie file for cookie.
  3. Print cookie to file.
  4. GET request and save html.
  5. Parse HTML and obtain crumb.
  6. Form URL.
  7. Form curl request.
  8. Execute request.

The code:

%Get cookie.
command = 'curl -s --cookie-jar cookie.txt https://finance.yahoo.com/quote/GOOG?p=GOOG';
%Execute request.
system(command);
%Read file.
cookie_file = fileread('cookie.txt');
%regexp the cookie.
cookie = regexp(cookie_file,'B\s*(.*)','tokens');
cookie = cell2mat(cookie{1});

%Print cookie to file (for curl purposes only).
file = fopen('mycookie.txt','w');
fprintf(file,'%s',cookie);

%Get request.
command = 'curl https://finance.yahoo.com/quote/GOOG?p=GOOG > goog.txt';
%Execute request.
system(command);
%Read file.
crumb_file = fileread('goog.txt');
%regexp the crumb.
crumb = regexp(crumb_file,'(?<="CrumbStore":{"crumb":")(.*)(?="},"UserStore":)','tokens');
crumb = crumb{:};

%Form the URL.
url = 'https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492524105&period2=1495116105&interval=1d&events=history&crumb=';
url = strcat(url,crumb);

%Form the curl command.
command = strcat('curl',{' '},'-v -L -b',{' '},'mycookie.txt',{' '},'-H',{' '},'"User-Agent:',{' '},'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36','"',{' '},'"',url,'"');
command = command{1};
system(command);

The final curl request:

curl -v -L -b mycookie.txt -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36" "https://query1.finance.yahoo.com/v7/finance/download/^NDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dSpwQstrQDp"

In the final curl request I am using the following flags:

-v: verbosity
-L: follow redirects
-b: use cookie file
-H: user agent header field (tried spoofing it with my browser)

And for every attempt, the response is the following:

 {
    "finance": {
        "error": {
            "code": "Unauthorized",
            "description": "Invalid cookie"
        }
    }
}

I studied the server response and every header value is successfully sent by the client, however it always results in the same error. Now I suspect that you simply cannot do that anymore as is explained here. So as pointed out by the user, you perhaps need to perform web scraping from a different location. Perhaps if you find a working URL you can open a new question and I would be happy to help out.

Paolo
  • 21,270
  • 6
  • 38
  • 69
  • Thank you for your testing! The thing that strikes me is that it seems some systems, like MATLAB, cannot properly disguise itself, the way others have been able to do in Python, C#... For now I'm going to leave the question open a bit longer in hopes of drawing some more attention to it – litmus Jun 13 '18 at 12:18
  • @litmus a possible alternative to downloading the .csv is to retrieve the data in Matlab, format it, and write that data in a .csv from Matlab. Is that a possible option for you? – Paolo Jun 14 '18 at 11:27
  • 1
    @litmus I tried curling a different url `https://query1.finance.yahoo.com/v8/finance/chart/AAPL?symbol=AAPL&period1=0&period2=9999999999&interval=1d` and got the response fine, so if creating the .csv is an option I can parse the response and work on that – Paolo Jun 14 '18 at 11:50
  • wow! If it gives *exactly* the same data as the CSV (date, open, high, low, close, adj close and volume) for any selected time period and stock, then yes! (the very reason for me persisting with Yahoo Finance is their low error rate) – litmus Jun 14 '18 at 11:53
  • And I guess I can check your method's historic portions against my earlier collected CSV files using an MD5 hash, or with some MATLAB isequal function – litmus Jun 14 '18 at 11:57
  • 1
    @litmus yes they are *exactly* the same ;) – Paolo Jun 15 '18 at 12:45
2

Yahoo has a number of checks to make sure that the request is coming from a webbrowser. Check this function out https://www.mathworks.com/matlabcentral/fileexchange/68361-yahoo-finance-data-downloader that makes Yahoo Finance to believe that the request is coming from a browser.

Here are a couple of examples on how this function can be used to download and analyse market data https://github.com/Lenskiy/market-data-functions

1

The following is a script which downloads the last month worth of data for AAPL stock and creates a .csv file named AAPL_14-05-2018_14-06-2018 which contains date, open, high, low, close, adj close and volume information as found here.

%Choose any ticker.
ticker = 'AAPL'; %'FB','AMZN'...

%Base url.
url = 'https://query1.finance.yahoo.com/v8/finance/chart/GOOG?symbol=';
%weboption constructor.
opts = weboptions();

%Start retrieving data from today.
today = datetime('now');
today.TimeZone = 'America/New_York';

%Convert dates to unix timestamp.
todayp = posixtime(today);
%Last week.
weekp  = posixtime(datetime(addtodate(datenum(today),-7,'day'),'ConvertFrom','datenum'));
%Last month.
monthp = posixtime(datetime(addtodate(datenum(today),-1,'month'),'ConvertFrom','datenum'));
%Last year.
yearp = posixtime(datetime(addtodate(datenum(today),-1,'year'),'ConvertFrom','datenum'));

%Add ticker.
url = strcat(url,ticker);

%Construct url, add time intervals. The following url is for last month worth of data.
url = strcat(url,'&period1=',num2str(monthp,'%.10g'),'&period2=',num2str(todayp,'%.10g'),'&interval=','1d');

%Execute HTTP request.
data = webread(url,opts);

%Get data.
dates    = flipud(datetime(data.chart.result.timestamp,'ConvertFrom','posixtime'));
high     = flipud(data.chart.result.indicators.quote.high);
low      = flipud(data.chart.result.indicators.quote.low);
vol      = flipud(data.chart.result.indicators.quote.volume);
open     = flipud(data.chart.result.indicators.quote.open);
close    = flipud(data.chart.result.indicators.quote.close);
adjclose = flipud(data.chart.result.indicators.adjclose.adjclose);

%Create table.
t = table(dates,open,high,low,close,adjclose,vol);

%Format filename: ticker, start date, end date.
namefile = strcat(ticker,'_',char(datetime(monthp,'Format','dd-MM-yyyy','ConvertFrom','posixtime')),...
           '_',char(datetime(todayp,'Format','dd-MM-yyyy','ConvertFrom','posixtime')),'.csv');

%Write table to file.
writetable(t,namefile);

The .csv file output (only last few days displayed):

     dates               open         high        low       close     adjclose    vol
14/06/2018 16:46    191.5500031 191.5700073 190.2200012 190.7599945 190.7599945 10252639
13/06/2018 13:30    192.4199982 192.8800049 190.4400024 190.6999969 190.6999969 21431900
12/06/2018 13:30    191.3899994 192.6100006 191.1499939 192.2799988 192.2799988 16911100
11/06/2018 13:30    191.3500061 191.9700012 190.2100067 191.2299957 191.2299957 18308500
08/06/2018 13:30    191.1699982 192         189.7700043 191.6999969 191.6999969 26656800
07/06/2018 13:30    194.1399994 194.1999969 192.3399963 193.4600067 193.4600067 21347200
06/06/2018 13:30    193.6300049 194.0800018 191.9199982 193.9799957 193.9799957 20933600
05/06/2018 13:30    193.0700073 193.9400024 192.3600006 193.3099976 193.3099976 21566000

Above I get the data for the last month. In the comments I show how this can be adapted for the last week and last year, too. I can easily adapt the code and convert it to a function for you to use for any stock and time interval. You would just need to tell me what sort of time intervals you are interested in.

Paolo
  • 21,270
  • 6
  • 38
  • 69