How to scrape Wikipedia GPS latitude/longitude?

Question

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.

<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>

I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?

Gilles Quénot · Accepted Answer · 2022-11-30T10:51:34.550

1

With xidel and xpath:

$ xidel -se '
    concat(
        (//span[@class="latitude"]/text())[1],
        " ",
        (//span[@class="longitude"]/text())[1]
    )
' 'https://en.wikipedia.org/wiki/Miami'

Output

25°46′31″N 80°12′31″W

Or

saxon-lint --html --xpath '<XPATH EXP>' <URL>

If you want most known tools:

curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html

Not mentioned, but regex are not the right tool to parse HTML

edited Nov 30 '22 at 10:51

answered Nov 29 '22 at 20:03

Gilles Quénot

173,512
41
224
223

Haha! So fast and furious I could not even think. – Bruno Peixoto Nov 29 '22 at 20:04
You're welcome, added 2 other solutions – Gilles Quénot Nov 29 '22 at 20:12
1

I looked for command `saxon-lint` and found you as author. Very impressive – Bruno Peixoto Nov 29 '22 at 20:14
I tried to a one-command run to install your saxon-lint? – Bruno Peixoto Nov 29 '22 at 20:19
Which OS/distro ? – Gilles Quénot Nov 29 '22 at 20:25
Linux Ubuntu 20.X – Bruno Peixoto Nov 29 '22 at 20:31
https://gist.githubusercontent.com/sputnick-dev/06f0746a2e5725e318f5d52d5380b0a9/raw/70c0a0c21db9f03a73b8114823fd4b0edff49ed9/install%2520saxon-lint – Gilles Quénot Nov 29 '22 at 20:38
Fixed double latitude echo-ed, and added longitude ;) – Gilles Quénot Nov 29 '22 at 21:54
The output is always somewhat XXºYY'ZZ"S. It should be something like the following `(S=='W' ? -1 : 1) XX+YY/60+ZZ/3600` ni pseudo-code. Is it simple to do it in shell script? – Bruno Peixoto Nov 29 '22 at 22:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249999/discussion-between-gilles-quenot-and-bruno-peixoto). – Gilles Quénot Nov 29 '22 at 22:13

score 1 · Answer 2 · answered Nov 30 '22 at 00:32

1

You can't parse HTML with RegEx. Please use an HTML-parser like xidel instead:

$ xidel -s "https://en.wikipedia.org/wiki/Miami" -e '
  (//span[@class="geo-dms"])[1],
  (//span[@class="geo-dec"])[1],
  (//span[@class="geo"])[1],
  replace((//span[@class="geo"])[1],";",())
'
25°46′31″N 80°12′31″W
25.775163°N 80.208615°W
25.775163; -80.208615
25.775163 -80.208615

Take your pick.

answered Nov 30 '22 at 00:32

Reino

3,203
1
13
21

1

@Gilles Quenot the first URL link is pure gold. I recommend for laughing. – Bruno Peixoto Nov 30 '22 at 15:25
How can I install `xidel` on Ubuntu? – Bruno Peixoto Nov 30 '22 at 15:31
@BrunoPeixoto I think https://unix.stackexchange.com/q/238226/476158 would be useful. I'm not a Linux user myself. – Reino Nov 30 '22 at 23:01

How to scrape Wikipedia GPS latitude/longitude?

2 Answers2

Output