-1

I've got a huge list of addresses in a KML file and I'm really struggling on how to extract everything except for the content inside the <address></address> tags.

Here is a sample of the XML:

    <Placemark>
        <styleUrl>#icon-ci-1</styleUrl>
        <name>PIGGLY WIGGLY COOGLE #276 B/C </name>
        <ExtendedData>
            <Data name='Address'>
                <value>309 E OAK ST MCRAE GA31055   </value>
            </Data>
        </ExtendedData>
        <description><![CDATA[Address: 309 E OAK ST MCRAE GA31055   ]]></description>
        <address>309 E OAK ST MCRAE GA31055   </address>
    </Placemark>
    <Placemark>
        <styleUrl>#icon-ci-1</styleUrl>
        <name>THE CORNER STORE INC          </name>
        <ExtendedData>
            <Data name='Address'>
                <value>1998 DAYTON BLVD CHATTANOOGA TN37415   </value>
            </Data>
        </ExtendedData>
        <description><![CDATA[Address: 1998 DAYTON BLVD CHATTANOOGA TN37415   ]]></description>
        <address>1998 DAYTON BLVD CHATTANOOGA TN37415   </address>
    </Placemark>
    <Placemark>
        <styleUrl>#icon-ci-1</styleUrl>
        <name>KAMBOI #2                     </name>
        <ExtendedData>
            <Data name='Address'>
                <value>4901 BONNY OAKS DR CHATTANOOGA TN37416   </value>
            </Data>
        </ExtendedData>
        <description><![CDATA[Address: 4901 BONNY OAKS DR CHATTANOOGA TN37416   ]]></description>
        <address>4901 BONNY OAKS DR CHATTANOOGA TN37416   </address>
    </Placemark>

Does anyone have a clue how Regex can be used to extract this data in this manner?

        309 E OAK ST MCRAE GA31055
        1998 DAYTON BLVD CHATTANOOGA TN37415
        4901 BONNY OAKS DR CHATTANOOGA TN37416

Thank you in advance!

WebMW
  • 514
  • 2
  • 13
  • 26

1 Answers1

0

Regexen aren't the tool of choice for your job. While you might get along with a powerful regex engine (eg. the one in perl or php) you will most likely fail on edge cases and the solution is guaranteed to be a nightmare to maintain.

Better employ the command line tools xml starlet and sed:

xml sel -t -c "/root/Placemark/address" in.xml >temp.xml
sed 's%</address>%\n%g;s%<address>%%g' temp.xml > out.xml

Explanation:

  • The first command extracts precisely the address elements from the xml and writes them into the temp file
  • The second command deletes the address elements inserting newlines to separate the data lines

Alternative solution:

Basically, xml starlet is a set of convenience shorthands for processing xml files by means of xslt (this remark isn't meant to diminish its author's work at all!). Therefore the first step can be replaced by running an xslt processor on the input fileusing the following xslt:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output omit-xml-declaration="yes" indent="no"/>
    <xsl:template match="/">
        <xsl:copy-of select="/root/Placemark/address"/>
    </xsl:template>
</xsl:stylesheet>

There are a number of xslt processors available (cf. this SO question. on most linuxes, xsltproc should be availableout-of-the-box.

Community
  • 1
  • 1
collapsar
  • 17,010
  • 4
  • 35
  • 61
  • Awesome! Thank you for the detailed answer. Perhaps the easiest way I just found is to convert the KML to XML and convert the XML to CSV using an online generator. Thank you for the advice and explanations for more solutions. – WebMW Feb 12 '15 at 22:26