0

I am hoping the experts here would be able to help me with a regular expression that I could use in MATLAB to get certain sections of a very large data file(262 MB, 4588786 lines and 252498496 characters w/o blanks!!).

I have the following text as input.

text = ['<node id="1672189900" lat="48.2212788" lon="11.4783959" version="6" timestamp="2015-05-03T23:00:27Z" changeset="30762503" uid="145231" user="woodpeck_repair">'...
             '<tag k="ref" v="14839"/>'...
             '<tag k="power" v="sub_station"/>'...
             '<tag k="operator" v="Isar-Amperwerke"/>'...
        '</node>'...
        '<node id="298991549" lat="52.651949" lon="10.267974" version="9" timestamp="2009-03-26T12:53:35Z" changeset="860721" uid="13203" user="bahnpirat">'...
             '<tag k="ref" v="105"/>'...
             '<tag k="power" v="tower"/>'...
        '</node>'...
        '<node id="309209822" lat="47.9339823" lon="11.1047609" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
        '<node id="309209824" lat="47.9342688" lon="11.1048045" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
        '<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
             '<tag k="power" v="substation"/>'...
             '<tag k="source" v="survey"/>'...
             '<tag k="operator" v="Energieversorgung Ottobunn"/>'...
        '</node>'...
        '<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
             '<tag k="power" v="substation"/>'...
        '</node>'];

I need to filter out three nodes which have the tag <tag k="power" v="sub(_)?station"/> contained in them. i.e I need a few lines above and below this tag and these should be my three matches.

Match 1:

'<node id="1672189900" lat="48.2212788" lon="11.4783959" version="6" timestamp="2015-05-03T23:00:27Z" changeset="30762503" uid="145231" user="woodpeck_repair">'...
             '<tag k="ref" v="14839"/>'...
             '<tag k="power" v="sub_station"/>'...
             '<tag k="operator" v="Isar-Amperwerke"/>'

Match 2:

<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
             '<tag k="power" v="substation"/>'...
             '<tag k="source" v="survey"/>'...
             '<tag k="operator" v="Energieversorgung Ottobunn"/>'

Match 3:

<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
             '<tag k="power" v="substation"/>'

With my limited knowledge and some help I have this expression

substation_nodes = regexp(text, '(<node.*?\">(.|\n)*?)(?=<\/node>)','match');

to achieve this result but it does not include nodes with only the required tags. It gives me all the nodes with tags.

I have tried modifying the above expression a lot but to no avail. I would be very grateful if someone could please help me find the required regular expression.

Thanks in advance!

EyesOfÖzil
  • 309
  • 1
  • 9
  • 2
    That's XML, so use an XML parser. Never parse XML with regex. – Biffen Nov 05 '15 at 19:17
  • Hey! The problem is I have never worked with XML and I dont have the option of working with xml parsers as well.I just have this input and i need to perform this task with MATLAB. Would you say it is not possible to achieve this using MATLAB? – EyesOfÖzil Nov 05 '15 at 19:23
  • 4
    @EyesOfÖzil MATLAB has an [XML parser](http://www.mathworks.com/help/matlab/import_export/importing-xml-documents.html) – sco1 Nov 05 '15 at 19:25
  • @Biffen reminds me of this old favourite: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – rayryeng Nov 05 '15 at 20:32
  • @rayryeng I would have included a link to that one if it wasn't about HTML, rather than XML. – Biffen Nov 06 '15 at 07:11
  • @Biffen @excaza @rayryeng hey guys thanks for pointing out that `xml` and `regex` don't go hand in hand. For using `xml` parsing in MATLAB, I have a few questions . Firstly my xml file is **262 MB** with **4588786 lines** and I have **16gb ram**. Can I use `xmlread()` (as DOM model) in MATLAB without running into java heap issues?(even after increasing MATLAB java heap to maximum?). Is there any other way? Because if there are memory issues and I can't use `regex` plus `xml` I am left with no more ideas. Any help appreciated. thanks! – EyesOfÖzil Nov 08 '15 at 01:56

1 Answers1

0
  1. My first suggestion would be to learn to use MATLAB's built-in xmlread() function.

  2. If you really want to do this with code, I would parse it as a text file:

    function [context] = getTagWithContext(filename, tagstr)
    fid = open(filename)
    context1 = fgetl(fid);
    context2 = fgetl(fid);
    while true
        line = fgetl(fid);
        if ~ischar(line), break; end; % break out of loop at end of file
    
        if ~isempty(strfind(line, tagstr))
             context = [context1 context2 line];
             return;
        else
             context1 = context2;
             context2 = line;
        end
    end
    

You can add lines of context in the obvious way, including reading some following lines if you wish, and a little error checking would probably be good too.

With something the size of an .xml file, speed will not be an issue. Sometimes, simple and clunky is better than pulling your hair out over a regexp. Also, you know exactly what you'll get with this approach. With a complicated regexp where you're essentially trying to grab arbitrary blocks of text around a known expression, you can sometimes get strange results.

gariepy
  • 3,576
  • 6
  • 21
  • 34