I have a bunch of large XML documents that contain geospatial information (KML, if anyone is interested) that are arranged in the following way:
<Placemark><SimpleData name="species">Unique number</SimpleData> ... coordinates</Placemark>
I would like to list all species ID for which total number of characters in between Placemark tags exceeds given threshold - 1,000,000. Following AWK script indicates which lines are breaking the limit:
for kmlfile in *.kml; do
echo "Processing $kmlfile"
awk -- '/<Placemark>/,/<\/Placemark>/ { if (length() > 10000) { printf("Line %d has %d characters\n", NR, length()); } }' $kmlfile
done
but I do not know how to make it display species ID instead of line number. Any ideas how to make it AWK, Python or anything else to your liking?
Here is a snippet how the document looks like:
<Document xmlns="http://www.opengis.net/kml/2.2">
<Folder><name>Export_Output02</name>
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#Export_Output02">
<SimpleData name="species">1312</SimpleData>
<SimpleData name="area">7848012</SimpleData>
<SimpleData name="irrep_area">0.00000012742</SimpleData>
<SimpleData name="groupID">2</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>-57.843052746056827,-33.032934004012787 -57.825312079170494,-33.089724736921667 -57.888494029914156,-33.073777852969904 -57.843052746056827,-33.032934004012787</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>-57.635769389832561,-33.032934004012787 -57.618028722946228,-33.089724736921667 -57.681210673689904,-33.073777852969904 -57.635769389832561,-33.032934004012787</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
And an example of a whole file: link to GDrive.
[Edit] I should add that this particular limit on the number of characters in "Placemark" is imposed by Google fusion tables. Each Placemark describes particular feature on a map and there can be many of those on the map. If any Placemark break 1M character limit, then conversion to fusion table will fail.