I am hoping the experts here would be able to help me with a regular expression that I could use in MATLAB to get certain sections of a very large data file(262 MB, 4588786 lines and 252498496 characters w/o blanks!!).
I have the following text as input.
text = ['<node id="1672189900" lat="48.2212788" lon="11.4783959" version="6" timestamp="2015-05-03T23:00:27Z" changeset="30762503" uid="145231" user="woodpeck_repair">'...
'<tag k="ref" v="14839"/>'...
'<tag k="power" v="sub_station"/>'...
'<tag k="operator" v="Isar-Amperwerke"/>'...
'</node>'...
'<node id="298991549" lat="52.651949" lon="10.267974" version="9" timestamp="2009-03-26T12:53:35Z" changeset="860721" uid="13203" user="bahnpirat">'...
'<tag k="ref" v="105"/>'...
'<tag k="power" v="tower"/>'...
'</node>'...
'<node id="309209822" lat="47.9339823" lon="11.1047609" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309209824" lat="47.9342688" lon="11.1048045" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobunn"/>'...
'</node>'...
'<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
'<tag k="power" v="substation"/>'...
'</node>'];
I need to filter out three nodes which have the tag <tag k="power" v="sub(_)?station"/>
contained in them. i.e I need a few lines above and below this tag and these should be my three matches.
Match 1:
'<node id="1672189900" lat="48.2212788" lon="11.4783959" version="6" timestamp="2015-05-03T23:00:27Z" changeset="30762503" uid="145231" user="woodpeck_repair">'...
'<tag k="ref" v="14839"/>'...
'<tag k="power" v="sub_station"/>'...
'<tag k="operator" v="Isar-Amperwerke"/>'
Match 2:
<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobunn"/>'
Match 3:
<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
'<tag k="power" v="substation"/>'
With my limited knowledge and some help I have this expression
substation_nodes = regexp(text, '(<node.*?\">(.|\n)*?)(?=<\/node>)','match');
to achieve this result but it does not include nodes with only the required tags. It gives me all the nodes with tags.
I have tried modifying the above expression a lot but to no avail. I would be very grateful if someone could please help me find the required regular expression.
Thanks in advance!