3

I am working on an XQuery library for getting simple geospatial information from GPS files (it's called GPXQuery and available at GitHub). GPX files often contain tracks of GPS coordinates, and can get quite big. My biggest test file has 20'000 points in it. GPX is very simple:

<gpx version="1.1" xmlns="http://www.topografix.com/GPX/1/1">
  <trk>
    <name>Berkeley Test Walk #1</name>
    <trkseg>
      <trkpt lon="-122.26794633083045" lat="37.878523925319314">
        <ele>78.4000015258789</ele>

There is a long sequence of <trkpt> elements, representing all the recorded GPS coordinates. I want to be able to process at least 100'000, hopefully more.

My first slightly complicated function calculates the distance of a recorded GPS track. The math does not matter here. The problem is that I run into stack issues. For my 20'000 point example, my standard Saxon setup already stops. I am sure that this could be fixed with more generous memory allocation, but I am wondering if something more fundamental may be going on.

My function should qualify for tail recursion optimization, but that's a bit hard to tell and may differ from product to product. Here's the function(s), and they are invoked by gpxquery:trk-distance($GPX/gpx:gpx)[1] to get the distance of the first GPX track in a given GPX document $GPX:

module namespace gpxquery = "https://github.com/dret/GPXQuery";

declare namespace xsd = "http://www.w3.org/2001/XMLSchema";
declare namespace math = "http://www.w3.org/2005/xpath-functions/math";
declare namespace gpx = "http://www.topografix.com/GPX/1/1";
declare variable $gpxquery:earth-radius := 6371000.0;

declare function gpxquery:trk-distance($gpx as element(gpx:gpx))
    as xsd:float*
{
    for $trk in 1 to count($gpx/gpx:trk)
        return sum(gpxquery:trk-distance-recurse($gpx/gpx:trk[$trk]/gpx:trkseg/gpx:trkpt))
};

declare function gpxquery:trk-distance-recurse($trkpts as element(gpx:trkpt)*)
    as xsd:float*
{
    if ( count($trkpts) le 1 )
      then 0
      else (
          gpxquery:distance-between-points($trkpts[1]/@lat, $trkpts[1]/@lon, $trkpts[2]/@lat, $trkpts[2]/@lon) ,
          gpxquery:trk-distance-recurse($trkpts[position() gt 1])
      )
};

declare function gpxquery:distance-between-points($lat1 as xsd:float, $lon1 as xsd:float, $lat2 as xsd:float, $lon2 as xsd:float)
    as xsd:float
{
    let $dlat  := ($lat2 - $lat1) * math:pi() div 180
    let $dlon  := ($lon2 - $lon1) * math:pi() div 180
    let $rlat1 := $lat1 * math:pi() div 180
    let $rlat2 := $lat2 * math:pi() div 180
    let $a     := math:sin($dlat div 2) * math:sin($dlat div 2) + math:sin($dlon div 2) * math:sin($dlon div 2) * math:cos($rlat1) * math:cos($rlat2)
    let $c     := 2 * math:atan2(math:sqrt($a), math:sqrt(1-$a))
    return xsd:float($c * $gpxquery:earth-radius)
};

Is there anything I should/could do differently in terms of code structure and algorithm to avoid these memory issues for bigger files? Or does this look like a reasonable approach for the general problem, and I whoever is using the library simply has to make sure that the runtime environment can deal with the requirements of deeply nested recursive calls?

Any feedback from people working with recursive functions and running into similar issues would be greatly appreciated.

dret
  • 531
  • 3
  • 7
  • Looks good to me. Basically, you answer your question yourself: It depends on the product, it is not always easy to apply tail recursion. I just run it within BaseX and it also does not apply tail recursion. If you are interested in using BaseX I am sure we would try our best to improve BaseX and apply tail recursion as well - Best would be to write to the mailing list if you are interested. – dirkk Dec 21 '15 at 22:34
  • thanks, @dirkk! you nailed the issue regarding tail recursion that's it's not something you can rely on. so what i am wondering is: if i wanted to rewrite this to behave more predictable, is there a recursive way of doing it, or would there be some non-recursive alternative to avoid the stack problems? – dret Dec 21 '15 at 22:48

2 Answers2

4

Saxon doesn't identify this function as tail recursive, because it applies an operator (the comma operator) to the result of the recursion, and any processing applied to the result disqualifies it as a tail call.

Interestingly if you rewrite it as an XSLT named template then it probably will qualify as tail recursive. That's because XSLT named templates (in Saxon) are natively evaluated in "push" mode - they write their results sequentially to an output sequence - which means that the "," operation is in effect implicit. With a bit of effort one could probably devise a similar strategy for functions.

But I'm trying to understand why this recursion is even necessary.

As far as I can see the algorithm you are implementing is to take a sequence $S and compute something along the lines of

f($S[1], $S[2]) + f($S[2], $S[3]) + f($S[3], $S[4]) ...

that is, you're applying a function to successive pairs of adjacent values and then computing the sum of these function applications.

You could write this as

sum(for-each-pair($S, tail($S), $f))

where $f is the function to be applied.

Or in more conventional XQuery 1.0 style you could write something like

sum(
 for $i in 1 to count($S)-1
 return f($S[i], $S[i+1])
)
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 1
    thanks a lot for explaining tail recursion that way. also, you're right about avoiding recursion in this case anyway. the code now looks like shown below, runs much faster, and can probably handle much bigger datasets. – dret Dec 22 '15 at 01:17
  • Please click on the tick-mark next to the question to mark it as accepted, so later visitors can see that the question is solved. – Michael Kay Dec 22 '15 at 09:13
0

the non-recursive version: shorter, faster, and less memory required:

declare function gpxquery:trk-distance($gpx as element(gpx:gpx))
    as xsd:double*
{
    for $trk in 1 to count($gpx/gpx:trk)
    let $trkpts := $gpx/gpx:trk[$trk]/gpx:trkseg/gpx:trkpt
    return sum(
        for $i in 1 to count($trkpts)-1
        return gpxquery:haversine($trkpts[$i]/@lat, $trkpts[$i]/@lon, $trkpts[$i+1]/@lat, $trkpts[$i+1]/@lon)
    )
};

declare function gpxquery:haversine($lat1 as xsd:double, $lon1 as xsd:double, $lat2 as xsd:double, $lon2 as xsd:double)
    as xsd:double
{
    (: This is the Haversine formula as described by http://stackoverflow.com/questions/365826/calculate-distance-between-2-gps-coordinates :)
    let $dlat  := ($lat2 - $lat1) * math:pi() div 180
    let $dlon  := ($lon2 - $lon1) * math:pi() div 180
    let $rlat1 := $lat1 * math:pi() div 180
    let $rlat2 := $lat2 * math:pi() div 180
    let $a     := math:sin($dlat div 2) * math:sin($dlat div 2) + math:sin($dlon div 2) * math:sin($dlon div 2) * math:cos($rlat1) * math:cos($rlat2)
    let $c     := 2 * math:atan2(math:sqrt($a), math:sqrt(1-$a))
    return xsd:double($c * 6371000.0)
};
dret
  • 531
  • 3
  • 7