0

I'm trying to extract the 2 attributes "lat" and "lon" from a file with the following format:

<trkpt lat="38.8577288" lon="-9.0997973"/>
<trkpt lat="38.8576367" lon="-9.1000557"/>
<trkpt lat="38.8575259" lon="-9.1006374"/>
...

and get the following output:

-9.0997973,38.8577288
-9.1000557,38.8576367
-9.1006374,38.8575259

(Yes the lat/lon pair are inverted on purpose)

I don't know much about regex, but looking around on the web, this is all I was able to achieve:

grep 'lat="[^"]*"' doc.txt | grep -no 'lat="[^"]*"'

output:
1:lat="38.8577288"
2:lat="38.8576367"
3:lat="38.8575259"

I'm not sure how to get going with this... Thanks in advance for your help

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
pascal
  • 641
  • 4
  • 12

3 Answers3

1

Using & (you shouldn't use regex to parse HTML or XML!)

if you don't have xmllint already, install libxml2.

for i in {1..3}; do
    lat=$(xmllint --html --xpath "string(//trkpt[$i]/@lat)" file.xml)
    lon=$(xmllint --html --xpath "string(//trkpt[$i]/@lon)" file.xml)
    echo "$lon,$lat"
done < file.xml 2>/dev/null

(remove --html if your XML is a full valid XML)


See RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Assuming the format remains in this order, it'll only take one pass.

Find:                           Replace:
.+lat="(.+?)".*lon="(.+?)".+    $2,$1

The capture groups make sure to look for lat and lon in that order and then grab what's within quotes. It makes sure to involve the rest of the line so the replace discards it.

0

Try using Python like so:

python -c 'import re; open("dest", "w").write("\n".join([lat + "," + lon for lat, lon in re.findall("""<trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>""", open("source").read())]))'

where dest is the path to the output file containing the comma-separated lat and lon values, and source is the path to the input file containing the XML style tags. (This is meant for use in a linux shell.) Note that I've assumed the input tags format will be very consistent.

The regex in there is <trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>.

If you don't have a linux shell handy, or you'd prefer using a python script or using it interactively, then use the following for a less one-liner approach:

#! /usr/bin/env python

# use the regex module
import re

# read in the file
in_file = open('source').read()

# Find matches using regex
matches = re.findall('<trkpt lat="([-0-9\.]+)" lon="([-0-9\.]+)"/>', in_file)

# make new file lines by combining lat and lon from matches
out_lines = [lat + ',' + lon for lat, lon in matches]

# convert array of strings to single string
out_lines = '\n'.join(out_lines)

# output to new file
open('dest', 'w').write(out_lines)
Pi Marillion
  • 4,465
  • 1
  • 19
  • 20
  • This solution worked great for me. Thanks to all for your help. – pascal Oct 20 '13 at 08:01
  • Don't use regexps to parse XML! – Matthias Urlichs Jan 22 '17 at 18:58
  • @MatthiasUrlichs OP's document may or may not be XML. The portion shown *appears* XML-compliant, but the rest of the document may or may not be. For example, Apache configuration files have elements that would appear to be XML if shown on their own, but an XML parser wouldn't work on the document as a whole, since the rest of it isn't XML-like at all. – Pi Marillion Jan 23 '17 at 00:16