2

NOTE: For more answers related to this, please see Special Characters in Google Calculator

I noticed when grabbing the return value for a Google Calculator calculation, the thousands place is separated by a rather odd character. It is not simply a space.

Let's take the example of converting $4,000 USD to GBP.

If you visit the following Google link:

http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2 497.81441 British pounds",error: "",icc: true}

This looks reasonable, and the thousands place appears to be separated by a whitespace character.

However, if you enter the following into your command line:

curl -s "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2?498.28243 British pounds",error: "",icc: true}

That question mark (?) is a replacement character. What is going on?

AppleScript returns a different replacement character:

{lhs: "4000 U.S. dollars",rhs: "2†498.28243 British pounds",error: "",icc: true}

I am also getting from other sources:

{lhs: "4000 U.S. dollars",rhs: "2�498.28243 British pounds",error: "",icc: true}

It turns out that � is the proper Unicode replacement character 65533.

Can anyone give me insight into what Google is passing me?

Community
  • 1
  • 1
spex
  • 1,110
  • 10
  • 21

3 Answers3

3

It's a non-breaking space, U+00A0. It's to ensure that the number won't get broken at the end of a line.

Google returns the correct encoding (UTF-8) however:

Content-Type: text/html; charset=UTF-8

so ...

  • if it comes out as a normal space (U+0020) instead (Firefox does that when copying, stupidly enough), then the application performs conversion of certain characters to lookalikes, maybe to fit in some sort of restricted code page (ASCII perhaps).
  • if there is a question mark, then it was correctly read as Unicode but some part in processing uses a legacy character set that doesn't contain that character so it gets converted.
  • if there is a replacement character � (U+FFFD) then it was likely read as UTF-8, converted into a legacy character set that contains the character (e.g. Latin 1) and then re-interpreted as UTF-8.
  • if there is a totally different character, such as your dagger (†), then I'd guess the response is read correctly as Unicode, gets converted to a character set that contains the character and re-interpreted in another character set. A quick look at the Mac Roman codepage reveals that A0 indeed maps to †.

Needless to say, some parts in whatever you use in processing that response seem to be horrible broken in regard to Unicode. Something I'd hope wouldn't really happen that often in this millennium, but apparently it still does.


I figured out what it was by fiddling around in PowerShell a bit:

PS Home:\> $wc = new-object net.webclient
PS Home:\> $x = $wc.downloadstring('http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp')
PS Home:\> [char[]]$x|%{"$_ - " + +$_}
...
" - 34
2 - 50
  - 160
4 - 52
9 - 57
8 - 56
. - 46
2 - 50
8 - 56
2 - 50
4 - 52
...

Also a quick look at the response headers revealed that the encoding is set correctly.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • Thank you. How did you determine this? – spex Oct 10 '12 at 20:46
  • I added a note as to the how. But that's fairly basic stuff, actually. – Joey Oct 10 '12 at 20:52
  • I really appreciate the thorough response. I have learned a lot. – spex Oct 10 '12 at 20:54
  • `curl -s "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp" | iconv -t UTF8 {lhs: "4000 U.S. dollars",rhs: "2 iconv: (stdin):1:33: cannot convert` – spex Oct 10 '12 at 21:28
  • I have no actual idea what's happening there since Unix essentially just passes bytes (i.e. random binary data around) which sometimes happens to be in some system-wide defined encoding. So it *may* be that there is quite a bit conversion already going on when `curl` prints text to its output stream. Results might vary depending on your language and encoding settings, terminal settings, `curl` settings and/or build options, etc. – Joey Oct 10 '12 at 21:55
  • The problem was that I was not providing a proper "from" (-f) to iconv. It turns out my terminal was encoding output in ISO-8859-1 and I needed to convert to UTF-8. This could be done with `iconv -f ISO-8559-1 -t UTF-8` Again, thanks for your help. – spex Oct 15 '12 at 23:26
  • Ah; no clue of `iconv` options. I only could interpret the result I saw. And reading that I still think Unix' idea of how encodings need to be implemented is braindead (as is the Windows console, probably both for historical reasons) ;-) – Joey Oct 16 '12 at 05:09
2

According to my tests with curl in the Terminal on OSX, by changing the International character encoding in the Terminal preferences : The encoding is iso latin 1.

When I set the encoding to UTF8 : I get "2?498.28243"

When I set the encoding to MacRoman : I get "2†498.28243"

First solution : use a user agent from any browser (Safari on OSX 10.6.8 in this example)

curl -s -A 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.48 (KHTML, like Gecko) Version/5.1 Safari/534.48' 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp'

Second solution : use iconv

curl -s 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp' |  iconv -t utf8 -f  iso-8859-1
jackjr300
  • 7,111
  • 2
  • 15
  • 25
  • I had tried to use inconv, but only entered the -t (to) and not the proper -f (from). Thank you. Your inconv solution is better than the one I attempted and is my current solution. – spex Oct 12 '12 at 21:39
  • I have found that the following outputs valid HTML in the answer portion of the JSON: `echo -en $(curl -s 'http://www.google.com/ig/calculator?hl=en&q=QUERY') > ~/temp.html` where the -e for echo interprets escapes, -n suppresses the echo newline, and QUERY represents a url encoded query. – spex Oct 15 '12 at 23:20
0

Try

set myUrl to quoted form of "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"
set xxx to do shell script "curl " & myUrl & " | sed 's/[†]/,/'"
adayzdone
  • 11,120
  • 2
  • 20
  • 37
  • That works just fine to "fix" the character in AppleScript, although still don't know why Google is returning this special character or what exactly it is. – spex Oct 10 '12 at 20:40