Updates after sample code.
Solution: as provided by BeniBela He figured out what I failed to make clear...It has to be command line, not necessarily regex, and offered up this solution:
xpath -e '//Placemark[contains(description, "Iron")]'
as promised:
|
( )
/ \
_______
| _ |
| | | | All must enter and pay homage! (Shrine of BeniBela)
Problem: I need some form of command line regex to accomplish the following: Detect in one file of a set of Placemarks, Placemarks which include a keyword (in this case Iron) in a contained CDATA tag. without grabbing Placemarks which do not have the keywod. (All data from <Placemark>
to </Placemark>
needs to be captured.)
Explanation:
Two code samples are given below, one showing three full placemarks, two of which are useless to me, the third of which I want. The second code sample shows just the one I am interested in.
I need to extract the valid Placemark from the data file (which contains hundreds of placemarks) and append it into another file. I will then merge this file into a properly formatted KML later. The data sets are from the US Geological Survey and are very large.
The idea here is to recover placemarks for mines which are extracting a given kind of Ore (Iron for this example), and create a specialized KML (Keyhole Markup Language) file for display in a Google Earth type application.
sample1 (Multiple data with one valid entry):
<Placemark>
<name>
Las Antos Prospect</name>
<Snippet>
Record 10005251</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005251">
10005251</a>
of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
Limestone</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Limestone, General</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.88273,-24.87527,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>
Unnamed Occurence</name>
<Snippet>
Record 10005252</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005252">
10005252</a>
of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Iron</td> ######################Iron here makes it valid
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.81607,-24.67527,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>
Merced I Quarry</name>
<Snippet>
Record 10005254</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005254">
10005254</a>
of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Producer</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
Limestone</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Limestone, General</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#ProducerStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-65.46052,-24.9586,0</coordinates>
</Point>
</Placemark>
The above sample contains two Placemarks which I have no use for, bracketing one which I need to extract.
Sample 2 (Showing just a 'valid' entry): (The capture would need to grab all of this)
<Placemark>
<name>
Unnamed Occurence</name>
<Snippet>
Record 10005252</Snippet>
<description>
<![CDATA[<p>
Record <a href="http://mrdata.usgs.gov/mrds/show.php?labno=10005252">
10005252</a>
of the <a href="http://mrdata.usgs.gov/mrds/">
Mineral Resources Data System</a>
</p>
<table border='1' padding='3' cellspacing='0'>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
oper_type</th>
<td>
Unknown</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
dev_stat</th>
<td>
Occurrence</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
ore</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
model</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod1</th>
<td>
Iron</td> ######################Iron here makes it valid
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod2</th>
<td>
</td>
</tr>
<tr valign='top'>
<th align='right' bgcolor='#ddffee'>
commod3</th>
<td>
</td>
</tr>
</table>
]]>
</description>
<styleUrl>
#defaultStyleMap</styleUrl>
<Point>
<altitudeMode>
relativeToGround</altitudeMode>
<coordinates>
-64.81607,-24.67527,0</coordinates>
</Point>
</Placemark>
Update 1:
I got this to work in a regex tester, but I am still working on how to get it into grep et.al.
<Placemark>\n<name>\n.*</name>\n<Snippet>\n.*\n<description>\n(?:(?:.*\n){48}.*Iron.*\n|(?:.*\n){41}.*Iron.*\n|(?:.*\n){35}.*Iron.*\n)(?:.*\n){3,16}\]\]>\n</description>\n(?:.*\n){8,12}</Placemark>