0

I have a kml file with list of destinations along with coordinates. Theres about 40+ destinations in this file. I am trying to parse the coordinates from it, when you look in the file you see "coordinates"..."/coordinates" so finding them won't be the hard part, but I can't see to get a full result. What I mean is, it will cut out -94. or any negative float from the beginning, and print the rest of it.

#!/usr/bin/python3.5

import re

def main():

    results = []
    with open("file.kml","r") as f:
        contents = f.readlines()

    if f.mode == 'r':
        print("reading file...")
        for line in contents:
            coords_match = re.search(r"(<coordinates>)[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
            if coords_match:
                coords_matchh = coords_match.group()
                print(coords_matchh)

here is some of the results I get

3502969,38.8555497
7662462,38.8583916
6280323,38.8866337
3655059,39.3983001

This is how the is format in the file, if it makes a difference

<coordinates>
  -94.5944738,39.031411,0
</coordinates>

If I modify this line, and remove coordinates from the beginning

coords_match = re.search(r"[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)

this is the results I get instead.

-94.7662462
-94.6280323
-94.3655059

This is essentially the desired result I want.

-94.7662462,38.8583916
-94.6280323,38.8866337
-94.3655059,39.3983001

andyADD
  • 610
  • 1
  • 6
  • 20
  • 2
    Using regular expressions to parse XML is not a good idea. Have you considered using an actual XML parser? [Here is the one built in to Python](https://docs.python.org/3.7/library/xml.etree.elementtree.html) – Kendas Aug 11 '19 at 07:19
  • 1
    see https://stackoverflow.com/questions/13712132/extract-coordinates-from-kml-batchgeo-file-with-python – georg Aug 11 '19 at 07:19

3 Answers3

1

While using actual parser is a way to go, as @Kendas suggested in the comments, you could try findall instead of search

>>> import re
>>> s = """<coordinates>
...   -94.5944738,39.031411,0
... </coordinates>"""
>>> re.findall(r'[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)', s)
['-94.5944738', '39.031411']
help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36
1

You can also use BeauitfulSoup to get the coordinates since it will be XML/HTML kind of parsing.

from bs4 import BeautifulSoup

text = """<coordinates>
              -94.5944738,39.031411,0
            </coordinates>
            <coordinates>
              -94.59434738,39.032311,0
            </coordinates>
            <coordinates>
              -94.523444738,39.0342411,0
            </coordinates>"""
soup = BeautifulSoup(text, "lxml")
coordinates = soup.findAll('coordinates')

for i in range(len(coordinates)):
    print(coordinates[i].text.strip()[:-2])

Output:

-94.5944738,39.031411
-94.59434738,39.032311
-94.523444738,39.0342411
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
1

An XML parser is overkill if you just want to extract simple and well-delimited data.

The main thing is to use a simpler regular expression, and to search over the whole file. Focus on capturing everything between tags:

with open("file.kml","r") as f:
    contents = f.read()
coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', contents, re.DOTALL)

This will return a list of matches. Every item in this list will look something like this:

'\n  -94.5944738,39.031411,0\n  '

So for every item, you need to:

  1. strip off the whitespace
  2. right split on the last ","
  3. discard the second result.

So you do this:

results = [c.strip().rsplit(',', 1)[0] for c in coords_match]

That gives you a list of the wanted strings.

If you actuall want to use the numbers, I would convert the numbers to floats (using a nested comprehension):

results = [tuple(float(f) for f in  c.strip().split(',')[:2]) for c in coords_match]

That will give you a list of 2-tuples of float.

A demonstration in IPython:

In [1]: import re                                                                                        

In [2]: text = """<coordinates> 
   ...:               -94.5944738,39.031411,0 
   ...:             </coordinates> 
   ...:             <coordinates> 
   ...:               -94.59434738,39.032311,0 
   ...:             </coordinates> 
   ...:             <coordinates> 
   ...:               -94.523444738,39.0342411,0 
   ...:             </coordinates>"""                                                                    

In [3]: coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', text, re.DOTALL)                  
Out[3]: 
['\n              -94.5944738,39.031411,0\n            ',
 '\n              -94.59434738,39.032311,0\n            ',
 '\n              -94.523444738,39.0342411,0\n            ']

In [4]: results1 = [c.strip().rsplit(',', 1)[0] for c in coords_match]                                   
Out[4]: ['-94.5944738,39.031411', '-94.59434738,39.032311', '-94.523444738,39.0342411']

In [5]: results2 = [tuple(float(f) for f in  c.strip().split(',')[:2]) for c in coords_match]            
Out[5]: 
[(-94.5944738, 39.031411),
 (-94.59434738, 39.032311),
 (-94.523444738, 39.0342411)]

Edit: If you want to save the data as SJON, then it is probably best to use the conversion to floats. Because that can be directly converted to JSON:

In [6]: import json

In [7]: print(json.dumps(results2, indent=2))                                                            
[
  [
    -94.5944738,
    39.031411
  ],
  [
    -94.59434738,
    39.032311
  ],
  [
    -94.523444738,
    39.0342411
  ]
]
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • thank you for your answer I actually had similar RE I had .*? but I didn't include the () so I was kind of heading on the right track.. I actually want to put the destination along with coordinates into a CSV file so its easier to grab what I need based on "," position. I won't be using the actual coordinates in python, its a project that will put the coordinates into a javascript file via python. – andyADD Aug 11 '19 at 13:38
  • @andyADD If you want to put it in a javascript file, you can convert the list of 2-tuples directly to JSON. See updated answer. – Roland Smith Aug 11 '19 at 18:04