Extract multiple attributes from tag using shell

Question

I'm trying to extract the 2 attributes "lat" and "lon" from a file with the following format:

<trkpt lat="38.8577288" lon="-9.0997973"/>
<trkpt lat="38.8576367" lon="-9.1000557"/>
<trkpt lat="38.8575259" lon="-9.1006374"/>
...

and get the following output:

-9.0997973,38.8577288
-9.1000557,38.8576367
-9.1006374,38.8575259

(Yes the lat/lon pair are inverted on purpose)

I don't know much about regex, but looking around on the web, this is all I was able to achieve:

grep 'lat="[^"]*"' doc.txt | grep -no 'lat="[^"]*"'

output:
1:lat="38.8577288"
2:lat="38.8576367"
3:lat="38.8575259"

I'm not sure how to get going with this... Thanks in advance for your help

It seems that you are getting `lat` in both commands, you are not asking for the `lon` at all ? — Ibrahim Najjar, Oct 19 '13 at 12:43

score 1 · Answer 1 · edited May 23 '17 at 12:24

1

Using xpath & bash (you shouldn't use regex to parse HTML or XML!)

if you don't have xmllint already, install libxml2.

for i in {1..3}; do
    lat=$(xmllint --html --xpath "string(//trkpt[$i]/@lat)" file.xml)
    lon=$(xmllint --html --xpath "string(//trkpt[$i]/@lon)" file.xml)
    echo "$lon,$lat"
done < file.xml 2>/dev/null

(remove --html if your XML is a full valid XML)

See RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:24

Community

1
1

answered Oct 19 '13 at 19:56

Gilles Quénot

173,512
41
224
223

Owch. You also shouldn't assume that there are exactly three points in there. – Matthias Urlichs Jan 22 '17 at 18:59

score 0 · Answer 2 · answered Oct 19 '13 at 17:36

Assuming the format remains in this order, it'll only take one pass.

Find:                           Replace:
.+lat="(.+?)".*lon="(.+?)".+    $2,$1

The capture groups make sure to look for lat and lon in that order and then grab what's within quotes. It makes sure to involve the rest of the line so the replace discards it.

score 0 · Accepted Answer · answered Oct 19 '13 at 19:37

Try using Python like so:

python -c 'import re; open("dest", "w").write("\n".join([lat + "," + lon for lat, lon in re.findall("""<trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>""", open("source").read())]))'

where dest is the path to the output file containing the comma-separated lat and lon values, and source is the path to the input file containing the XML style tags. (This is meant for use in a linux shell.) Note that I've assumed the input tags format will be very consistent.

The regex in there is <trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>.

If you don't have a linux shell handy, or you'd prefer using a python script or using it interactively, then use the following for a less one-liner approach:

#! /usr/bin/env python

# use the regex module
import re

# read in the file
in_file = open('source').read()

# Find matches using regex
matches = re.findall('<trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>', in_file)

# make new file lines by combining lat and lon from matches
out_lines = [lat + ',' + lon for lat, lon in matches]

# convert array of strings to single string
out_lines = '\n'.join(out_lines)

# output to new file
open('dest', 'w').write(out_lines)

This solution worked great for me. Thanks to all for your help. — pascal, Oct 20 '13 at 08:01
@MatthiasUrlichs OP's document may or may not be XML. The portion shown *appears* XML-compliant, but the rest of the document may or may not be. For example, Apache configuration files have elements that would appear to be XML if shown on their own, but an XML parser wouldn't work on the document as a whole, since the rest of it isn't XML-like at all. — Pi Marillion, Jan 23 '17 at 00:16

Extract multiple attributes from tag using shell

3 Answers3