0

I have a bunch of large XML documents that contain geospatial information (KML, if anyone is interested) that are arranged in the following way:

<Placemark><SimpleData name="species">Unique number</SimpleData> ... coordinates</Placemark>

I would like to list all species ID for which total number of characters in between Placemark tags exceeds given threshold - 1,000,000. Following AWK script indicates which lines are breaking the limit:

for kmlfile in *.kml; do
    echo "Processing $kmlfile"
    awk -- '/<Placemark>/,/<\/Placemark>/ { if (length() > 10000) { printf("Line %d has %d characters\n", NR, length()); } }' $kmlfile
done

but I do not know how to make it display species ID instead of line number. Any ideas how to make it AWK, Python or anything else to your liking?

Here is a snippet how the document looks like:

<Document xmlns="http://www.opengis.net/kml/2.2">
    <Folder><name>Export_Output02</name>
        <Placemark>
            <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
            <ExtendedData><SchemaData schemaUrl="#Export_Output02">
                <SimpleData name="species">1312</SimpleData>
                <SimpleData name="area">7848012</SimpleData>
                <SimpleData name="irrep_area">0.00000012742</SimpleData>
                <SimpleData name="groupID">2</SimpleData>
            </SchemaData></ExtendedData>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>-57.843052746056827,-33.032934004012787 -57.825312079170494,-33.089724736921667 -57.888494029914156,-33.073777852969904 -57.843052746056827,-33.032934004012787</coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>-57.635769389832561,-33.032934004012787 -57.618028722946228,-33.089724736921667 -57.681210673689904,-33.073777852969904 -57.635769389832561,-33.032934004012787</coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
    </Folder>
</Document>

And an example of a whole file: link to GDrive.

[Edit] I should add that this particular limit on the number of characters in "Placemark" is imposed by Google fusion tables. Each Placemark describes particular feature on a map and there can be many of those on the map. If any Placemark break 1M character limit, then conversion to fusion table will fail.

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
  • 1
    Why are you counting characters, when your data is in XML? Are arbitrary characters relevant in any way? Can't you use XML extraction and test for the contents of an element, for example? With an XML method it would be quite easy to count the number of characters in a certain element. In the case you are actually selecting several files, why not test file size? (it seems to me that the difference between 10000 and the few characters outside the `` is very small and would be irrelevant.) Perhaps you should describe what you are trying to achieve. – helderdarocha Jun 15 '14 at 12:38
  • Thanks @helderdarocha for looking again into my questions. Google fusion tables are not accepting "Placemarks" longer than one million of characters. For this reason I need to find all Placemarks that are exceeding this limit and perform some manual preprocessing on them. Since I have 6000 files, I don't want to go one by one and check which 'specie' violated the limit. How can I extract "Placemark"? I tried following examples like this one [link](http://stackoverflow.com/questions/10475654/extract-elements-from-xml-file-using-python), but I am always getting an empty list. – Lukasz Tracewski Jun 15 '14 at 12:56

1 Answers1

0

I came up with a crude Python script that does the job. Surely it is not the nicest approach, so if you have a better one I will very happy to see it. Besides, the way I extract specie ID is quite ugly - suggestion on hot to make it prettier are also much welcomed.

import glob
from collections import namedtuple
Placemark = namedtuple('Placemark', 'found no_characters specie_id end_idx')


def GetPlacemark(input_file, start):
    start_idx = input_file.find('Placemark', start)
    end_idx = input_file.find('/Placemark', start)
    if start_idx == -1 or end_idx == -1:
        return Placemark(False, -1, -1, -1)
    no_characters = end_idx - start_idx
    specie_name_idx = input_file.find('species', start_idx, end_idx)
    specie_id_start_idx = input_file.find('>', specie_name_idx)
    specie_id_end_idx = input_file.find('<', specie_name_idx)
    specie_id = int(data[specie_id_start_idx+1:specie_id_end_idx])
    return Placemark(True, no_characters, specie_id, end_idx)

path_to_kml = glob.glob('*.kml')
for kml_file in path_to_kml:
    print 'Processing ' + kml_file
    with open (kml_file, "r") as myfile:
        data=myfile.read().replace('\n', '')

    placemarks = []
    current_idx = 0

    while True:
        mark = GetPlacemark(data, current_idx)
        if mark.found:
            placemarks.append(mark)
            current_idx = mark.end_idx + 1
        else:
            break

    for placemark in placemarks:
        if placemark.no_characters > 1000000:
            print 'Specie %d has %d characters' % (placemark.specie_id, placemark.no_characters)
    print 'Done\n'
Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53