0

I'm trying to run a script for pulling finance history from yahoo. Boris's answer from this thread wget can't download yahoo finance data any more works for me ~2 out of 3 times, but fails if the crumb returned from the cookie has a "\" character in it. Code that sometimes works looks like this

#!usr/bin/sh
symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')
wget --no-check-certificate --save-cookies=cookie.txt https://finance.yahoo.com/quote/$symbol/?p=$symbol -O C:/trip/stocks/stocknamelist/crumb.store
crumb=$(grep 'root.*App' crumb.store | sed 's/,/\n/g' | grep CrumbStore | sed 's/"CrumbStore":{"crumb":"\(.*\)"}/\1/')
echo $crumb
fileloc=$"https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo $fileloc
wget --no-check-certificate --load-cookies=cookie.txt $fileloc -O c:/trip/stocks/temphistory/hs$symbol.csv
rm cookie.txt crumb.store

But that doesn't seem to process in wget the way I intend either, as it seems to be interpreting as described here: https://askubuntu.com/questions/758080/getting-scheme-missing-error-with-wget Any suggestions on how to pass the $crumb variable into wget so that wget doesn't error out if $crumb has a "\" character in it?

Edited to show the full script. To clarify I've got cygwin installed with wget package. I call the script from cmd prompt as (example where the script above is named "stocknamedownload.sh, the stock symbol I'm downloading is "A" from the startdate 19800101)

c:\trip\stocks\StockNameList>bash stocknamedownload.sh A 19800101

This script seems to work fine - unless the crumb returned contains a "\" character in it.

JimL
  • 13
  • 2
  • 1
    `$'"'"`? That's a lot of mess to go through for no reason. – Charles Duffy Oct 10 '17 at 14:44
  • ("for no reason" meaning that adding more literal quotes doesn't in any way substitute for missing syntactic quotes). – Charles Duffy Oct 10 '17 at 14:44
  • That said -- trying to reproduce this from the code you provided, I don't have anything that matches `root.*App` in my `crumb.store`. Can you try to add enough content that others can see the problem themselves? – Charles Duffy Oct 10 '17 at 14:47
  • Added additional content that should help clarify/reproduce. Thanks for looking. – JimL Oct 10 '17 at 22:53
  • (Hmm -- looks like Cygwin doesn't ship `jq`. That's annoying -- and frankly, I'm a bit surprised; it's been in widespread use for long enough now that I'd have expected them to pick it up). – Charles Duffy Oct 10 '17 at 23:46
  • BTW, I notice `C:`-style paths. Does that mean you're using a Windows-native wget executable rather than the cygwin-provided one? – Charles Duffy Oct 10 '17 at 23:48
  • I searched for and added the wget cygwin package. Wget is executing for me ok (the script is working *most* of the time. I was able to add the jq package as well, though after 15 minutes of browsing the tutorial, I haven't been able to follow on how it's going to change the result of my $crumb variable as it is executed on the wget line? – JimL Oct 11 '17 at 01:10

2 Answers2

0

You are adding quotes to the value of the variable instead of quoting the expansion. You are also trying to use tools that don't know what JSON is to process JSON; use jq.

wget --no-check-certificate \
     --save-cookies=cookie.txt \
     "https://finance.yahoo.com/quote/$symbol/?p=$symbol" \
     -O C:/trip/stocks/stocknamelist/crumb.store


# Something like thist; it's hard to reverse engineer the structure
# of crumb.store from your pipeline.
crumb=$(jq 'CrumbStore.crumb' crumb.store)
echo "$crumb"

fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo "$fileloc"
wget --no-check-certificate \
     --load-cookies=cookie.txt "$fileloc" \
     -O c:/trip/stocks/temphistory/hs$symbol.csv
chepner
  • 497,756
  • 71
  • 530
  • 681
  • Is your answer intended for java implementation? Trying to implement as a shell script with cygwin. Thanks for looking. – JimL Oct 10 '17 at 22:56
  • @JimL, `jq` is not a Java application -- it's intended for use from shell. See https://stedolan.github.io/jq/ – Charles Duffy Oct 10 '17 at 23:18
  • (Also, while I'm not sure that this will work as-is -- a copy of the `crumb.store` would help to validate -- the general approach of using a program that understands JSON to decode a JSON string has a rational relationship to solving your immediate problem; if you have a literal `\"` substring, that's a JSON encoding of the literal `"`; if you have `jq` decoding a string, it'll decode that literal as part of it). – Charles Duffy Oct 10 '17 at 23:25
  • 1
    ...actually, if you want that unescaping, it would need to be `jq -r`, not bare `jq`. – Charles Duffy Oct 10 '17 at 23:26
  • This is my first foray into Unix style regular expressions (installed cygwin just to try to get a working solution), so I could use a little more hand holding on this. I installed the jq package and naively tried to convert my crumb variable to json style by adding the line. crumb=jq($crumb) – JimL Oct 11 '17 at 00:56
  • Looking through the tutorial on jq, it looks like it's supposed to do very similar things to grep and sed. I don't believe my issue is actually with the $crumb variable. The echo of $crumb is giving the expected result whether there is a "\" in it or not. My problem (I believe) is in how wget is interpretting the variable when there's a "\" in it. The wget call seems to convert the "\" to %5C. – JimL Oct 11 '17 at 01:04
  • @chepner, ...so, I've gotten a look at the data -- it's a HTML document; not well-formed JSON as a whole by any means. – Charles Duffy Oct 11 '17 at 02:13
  • @JimL, `echo` is not trustworthy. Use `declare -p crumb` rather than `echo "$crumb"` to get an accurate description of the variable's contents. – Charles Duffy Oct 11 '17 at 02:17
  • @JimL, the other thing is that when `sed` returns `\"`, the **actual data** is just `"` with no backslash proceeding it. If that backslash gets sent back to the browser, that's a problem. Using a compliant JSON decoder (like `jq`) will prevent that from happening. – Charles Duffy Oct 11 '17 at 02:25
0

The following implementation appears to work 100% of the time -- I'm unable to reproduce the claimed sporadic failures:

#!/usr/bin/env bash

set -o pipefail

symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')

# store complete webpage text in a variable
page_text=$(curl --fail --cookie-jar cookies \
  "https://finance.yahoo.com/quote/$symbol/?p=$symbol") || exit

# extract the JSON used by JavaScript in the page
app_json=$(grep -e 'root.App.main = ' <<<"$page_text" \
           | sed -e 's@^root.App.main = @@' \
                 -e 's@[;]$@@') || exit

# use jq to extract the crumb from that JSON
crumb=$(jq -r \
          '.context.dispatcher.stores.CrumbStore.crumb' \
          <<<"$app_json" | tr -d '\r') || exit

# Perform our actual download
fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
curl --fail --cookie cookies "$fileloc" >"hs$symbol.csv"

Note that the tr -d '\r' is only necessary when using a native-Windows jq mixed with an otherwise native-Cygwin set of tools.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441