0

Updates after sample code.

Solution: as provided by BeniBela He figured out what I failed to make clear...It has to be command line, not necessarily regex, and offered up this solution:

xpath -e '//Placemark[contains(description, "Iron")]'

as promised:

       |
      ( )
     /   \
    _______
   |   _   |
   |  | |  |  All must enter and pay homage! (Shrine of BeniBela)

Problem: I need some form of command line regex to accomplish the following: Detect in one file of a set of Placemarks, Placemarks which include a keyword (in this case Iron) in a contained CDATA tag. without grabbing Placemarks which do not have the keywod. (All data from <Placemark> to </Placemark> needs to be captured.)

Explanation:

Two code samples are given below, one showing three full placemarks, two of which are useless to me, the third of which I want. The second code sample shows just the one I am interested in.

I need to extract the valid Placemark from the data file (which contains hundreds of placemarks) and append it into another file. I will then merge this file into a properly formatted KML later. The data sets are from the US Geological Survey and are very large.

The idea here is to recover placemarks for mines which are extracting a given kind of Ore (Iron for this example), and create a specialized KML (Keyhole Markup Language) file for display in a Google Earth type application.

sample1 (Multiple data with one valid entry):

<Placemark>
<name>
Las Antos Prospect</name>
<Snippet>
Record 10005251</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005251">
10005251</a>
 of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
Limestone</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Limestone, General</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.88273,-24.87527,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>
Unnamed Occurence</name>
<Snippet>
Record 10005252</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005252">
10005252</a>
 of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Iron</td>                        ######################Iron here makes it valid
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.81607,-24.67527,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>
Merced I  Quarry</name>
<Snippet>
Record 10005254</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005254">
10005254</a>
 of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Producer</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
Limestone</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Limestone, General</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#ProducerStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-65.46052,-24.9586,0</coordinates>
</Point>
</Placemark>

The above sample contains two Placemarks which I have no use for, bracketing one which I need to extract.

Sample 2 (Showing just a 'valid' entry): (The capture would need to grab all of this)

<Placemark>
<name>
Unnamed Occurence</name>
<Snippet>
Record 10005252</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005252">
10005252</a>
 of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Iron</td>                        ######################Iron here makes it valid
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.81607,-24.67527,0</coordinates>
</Point>
</Placemark>

Update 1:

I got this to work in a regex tester, but I am still working on how to get it into grep et.al.

<Placemark>\n<name>\n.*</name>\n<Snippet>\n.*\n<description>\n(?:(?:.*\n){48}.*Iron.*\n|(?:.*\n){41}.*Iron.*\n|(?:.*\n){35}.*Iron.*\n)(?:.*\n){3,16}\]\]>\n</description>\n(?:.*\n){8,12}</Placemark>
Community
  • 1
  • 1
Jase
  • 519
  • 1
  • 9
  • 23
  • For those who are concerned about such things, this data is from a database released Public Domain by the United States Geological Survey, and is an exact copy paste duplication (not including the leading four spaces required for proper display) – Jase Oct 04 '13 at 06:41
  • And what have you tried yet to solve your problem? – Zsolt Botykai Oct 04 '13 at 06:58
  • LOL...Ive been pounding my head against this for two days using varieties of regex's. Everything I have tried so far has 'grabbed' the first occurance of `` all the way through to the word Iron. I have yet to figure out how to get a regex that grabs only the `` which contains the word 'Iron'. If I can solve that I am half way home. The biggest issue the tags (both XML and HTML) have differing content in many places due to missing or varient data in the original USGS database. I should add that I am but a babe when it comes to regexs. – Jase Oct 04 '13 at 08:22
  • [You can't parse XML with regex](http://stackoverflow.com/a/1732454/7552) – glenn jackman Oct 04 '13 at 10:46
  • @ glenn: Prior to BeniBela's solution, I had the answer nearly worked out using regexs. You mistake 'parse' for 'copy' No need for me to parse it, I don't care what it contains, or if it is formed correctly, I only care about what it pertains too. – Jase Oct 04 '13 at 11:40

1 Answers1

1

That is trivial with XPath instead regex:

/Placemark[contains(description, "Iron")]

(or /*/Placemark[contains(description, "Iron")] if your xml contains a (required) root element)

BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • Almost perfect. Xpath (bug I think) is outputting all `<` symbols with their HTML equivalent `<`, but it found them all! Easy fix to cause sed to alter the symbols! – Jase Oct 04 '13 at 11:20
  • As I said in the question (or implied I guess) it just needed to be command line! – Jase Oct 04 '13 at 11:23