9

I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.

Here's what my XML file looks like, shortened of course:

<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
  <trk>
      <name>2013-12-03T21:08:56Z</name>
      <trkseg>
          <trkpt lat="45.4852855" lon="-122.6347885">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:08:56Z</time>
          </trkpt>
          <trkpt lat="45.4852961" lon="-122.6347926">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:09:00Z</time>
          </trkpt>
          <trkpt lat="45.4852982" lon="-122.6347897">
              <ele>0.2000000</ele>
              <time>2013-12-03T21:09:01Z</time>
          </trkpt>
      </trkseg>
  </trk>
</gpx>

There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:

LAT         LON         TIME         ELE
45.4...     -122.6...   2013-12...   0.00...
...         ...         ...          ...

Here's the code I have so far:

## Call libraries
import csv
from xmlutils.xml2csv import xml2csv

inputs = "myfile.xml"
output = "myfile.csv"

converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")

This is another alternative code. It merely outputs a CSV file with no data, just the headers lat and lon.

import csv
import lxml.etree

x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
  <name>2013-12-03T21:08:56Z</name>
  <trkseg>
    <trkpt lat="45.4852855" lon="-122.6347885">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:08:56Z</time>
    </trkpt>
    <trkpt lat="45.4852961" lon="-122.6347926">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:09:00Z</time>
    </trkpt>
    <trkpt lat="45.4852982" lon="-122.6347897">
      <ele>0.2000000</ele>
      <time>2013-12-03T21:09:01Z</time>
    </trkpt>
  </trkseg>
</trk>
</gpx>
'''

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(('lat', 'lon'))
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('trkpt'):
        row = trkpt.get('lat'), trkpt.get('lon')
        writer.writerow(row)

How do I do this? Please realize I'm a novice, so a more comprehensive explanation would be super awesome!

oatmilkyway
  • 429
  • 1
  • 6
  • 17
  • And what did you get as the output? Is `xmlutils` not reading the `lat` and `lon` attributes properly? Describe how the output differs from what you expected. A quick glance at the documentation indicates that the tag should probably be `trkpt`. – ChrisP Dec 21 '13 at 00:40
  • When I run the program with `trkpt` as the indicated tag, the program completes without any error messages and produces an empty .csv file. I don't know if xmlutils isn't reading the `lat` or `lon` properly because I don't know how the program works, plus no error message. I expected the output to produce at a minimum `time` and `ele`, but it didn't. – oatmilkyway Dec 21 '13 at 00:47
  • Did I not explain it clearly enough? I don't get how xmlutils or any of these work. The documentation is poor. I can do the example files they provide, but with the tags/items embedded within the elements (i.e., `lat` and `lon` within `trkpt`), I just don't get it. I've been sitting at the computer for about 12 hours on this one stupid point. – oatmilkyway Dec 21 '13 at 01:05

3 Answers3

22

This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.

The namespaces used in the document are defined at the top:

xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"

So the first namespace is mapped to the short form tc2, and would be used in an element like <tc2:foobar/>. The last one, which doesn't have a short form after the xmlns, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt /> elements as well.

Therefore you would need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt') to select these elements.

In order to also get time and elevation, you can use trkpt.find() to access these elements below the trkpt node, and then element.text to retrieve those elements' text content (as opposed to attributes like lat and lon). Also, because the time and ele elements also use the default namespace you'll have to use the {namespace}element syntax again to select those nodes.

So you could use something like this:

NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('{%s}trkpt' % NS):
        lat = trkpt.get('lat')
        lon = trkpt.get('lon')
        ele = trkpt.find('{%s}ele' % NS).text
        time = trkpt.find('{%s}time' % NS).text

        row = lat, lon, ele, time
        writer.writerow(row)

For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx format.

Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
1

Apologies for using already-made tools here, but this did the job with your data :

  1. Convert XML to JSON : http://convertjson.com/xml-to-json.htm
  2. Take that JSON and convert JSON to CSV : https://konklone.io/json/

It worked like a charm with your data.

ele,time,_lat,_lon
0.0000000,2013-12-03T21:08:56Z,45.4852855,-122.6347885
0.0000000,2013-12-03T21:09:00Z,45.4852961,-122.6347926
0.2000000,2013-12-03T21:09:01Z,45.4852982,-122.6347897

So for coding, I reckon XML > JSON > CSV may be a good approach. You many find the relevant scripts pointed to in those links.

Nikhil VJ
  • 5,630
  • 7
  • 34
  • 55
  • after converting the file from xml to json and then using the json to convert to csv. Inside the webpage the data is shown in a dataframe but once I download the csv file, everything is shown only in rows –  Dec 02 '19 at 08:48
  • @ebe are you opening the csv file in simple notepad in windows? It may be not recognizing the line-breaks - typical linux-windows problem. Try opening the csv in excel, LibreOffice Calc or notepad++ or VScode or something. – Nikhil VJ Dec 02 '19 at 11:57
  • I think its because excel 365. I have another question. Do you happen to know any code that does the same as konklone website using python. The website seems to understand the structure of the json and then put it in table form. Im looking for something similar but just in python code form. Either way I upvoted your answer. thank u –  Dec 02 '19 at 12:37
  • @ebe if you're looking for json to csv conversion then search for that - there should be plenty of options. If it's a flat json array (or list of dicts in python terms) then a straighforward way is to convert it to pandas dataframe (`df = pd.DataFrame(dict1)`) and then write out a csv (`df.to_csv('t1.csv')`). If it's more hierarchical though then flattening has to be done first. `flatten_json` is a python package available. – Nikhil VJ Dec 03 '19 at 14:53
1

I wrote gpxcsv exactly for this case, and to handle gpx extension fields that other converters omit.

As easy as:

from gpxcsv import gpxtolist
import pandas as pd

df = pd.DataFrame(
    pxtolist('myfile.gpx'))

for a dataframe, or a command line tool exists to just create a .csv or .json file, preserving as many columns in the trackpoint as it finds using the tags as the column names.

Source code of the project on github.

Marcos
  • 71
  • 3