1

I need to find specific names (ie, a few names matching several regexp) and for each hide their corresponding values, wherever they appeared in the xml By using a XML parsing library (Twig? libXML ? other?)
The regexp part is not for parsing, but for then selecting which nodes I need to edit (ie, I need to parse using a XML aware library, but then only to replace values in nodes where the name matches a specific complex regexp).

Deep apologies for not providing code attempts :( Even though I did try to adapt some of the answers (ex: https://stackoverflow.com/a/11482566/1841533, and quite a few others as well) I am too new to Perl to have come up with something that 1) would browse at any depth in the file, 2) looking for names that match a regexp. Posting my meager attempts would only "narrow" the direction of the discussion (ie, I really want to avoid the XYProblem : if I showed my existing attempts, and they were corrected, the requirements below would NOT be matched as my attempts lacked either "at any depth" or "name matching a regexp" completely ...)

**If you need (I can totally understand that...) sample codes, please don't read further. ** (or just a bit, to see why I don't provide any)
If however you can just read the 3 XML examples below, and the following 4 dots indicating what I need to do to them, (or better, everything after the 'What I need:' line), and provide me with a "template" script (ie, a few perl, if possible using twig or libXML), i'll forever be in your debt ^^.
[I do take lot of time to provide help to many people on various se sites... and I often wish they posted sample codes. So I understand why many people will downvote this, or just won't answer, or feel frustrated. But I can't manage to produce one sample code here without "warping" what I need too much, creating a XYproblem, hence I preferred posting what I need instead of what I tried...]

What I need

I have many xml files with different structures.

in the following: "someNames" could be several different strings, amongst which I'll need to find only those matching a (complex) regexp.
And once I find one (or several match) "someValue" will be the associated value , which i'll want to replace by a generic string.

The xmls are quite simple, but they still have several different structures:

For example sometimes the XML could contain

...
   <sometag  name=someName  value=someValue>  
...

(someName or someValue could be within quotes or not)

or

...
   <someName>someValue</someName>  
...

or even another form:

... 
   <someothertag   someName=someValue>
...
  • someValue could be withing quotes, or not, when it is after a "=", depending on the xml
  • someName also could be within quotes or not, when it's written as name=someName
  • someName changes in each file, but I want to find some matching a specific complex regexp (for example: /\(abc\)|\([^xyz]*def\)|..../ , ie the regexp could be quite complex )

  • for those "someName" that match the regexp, and only if they match, I want to change the corresponding "someValue" by a generic string, for example "hidden". (someValue itself can change in each file. But whatever it is (ie, can match ".*"), i want to replace it with the new value "hidden")

The deepness of the tags can also vary from file to file (so I need a generic parsing)

I'm sorry but I cannot find how to do that, as every exemple i found here are for a specific tag or specific structure, and from them I couldn't grasp the way to use twig or libXML to do a more generic approach... (I am very very new to Perl!)

I have trouble finding how to place the regexp, and even how to parse several XML and look for the name on any level within each xml

Any hint on how to do this is welcomed!

Update: I'm trying hard to come up with a reasonnable first try... But I think by the time I come up with one, i can delete that question. Right now i'm trying to Grok https://stackoverflow.com/a/11482566/1841533 : but it is NOT what I need. I need to modify that example to 1) allow to open any file (instead of provinding the XML directly as in that answer) 2) I need to use "findnodes" to find any tag whose name (tagName, and not its correspondign value) matches a regexp (and not some fixed "string") 3) and then once I find those tagnames, i need to edit the corresponding value to change it to "hidden".

Community
  • 1
  • 1
Olivier Dulac
  • 3,695
  • 16
  • 31
  • 1
    You should post your attempts to get help on what you did wrong. – TLP Nov 08 '13 at 16:05
  • @TLP: you're right, and I usually would, but here it would only show that I'm so new to Perl that I am not even parsing the file ... :/ I tried to use existing answers to do it, but I end up not parsing "everywhere" but just in a specifically named someName at a specific depth. hence my question. – Olivier Dulac Nov 08 '13 at 16:30
  • Also, a more detailed example XML would be nice. Post a real-life structure, but change the values and names if you are uncomfortable with it. – simbabque Nov 08 '13 at 16:30
  • Btw, it's not a good idea to try to parse XML with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – simbabque Nov 08 '13 at 16:33
  • 2
    http://meta.stackexchange.com/q/156810/162416 is a good thing to read. – TLP Nov 08 '13 at 16:38
  • @simbabque: I am not using regexp to parse the XML, i want to use twig (or libXML, etc), and parse using those, and with those change value for some names matching a regexp. that's very different. – Olivier Dulac Nov 08 '13 at 16:57
  • @TLP: indeed, i'm re-reading it right now. "If your question doesn't include code, are you sure it shouldn't?" : unfortunately, yes... I could show code that is completely missing what i'm trying to do, some using "twig", some using "libXML" :( If I had one that almost worked, i defnitely would post it to seek corrections on it. but posting the ones I tried would only constrain answers and direct efforts on a specific direction (ie, i'd receive answers working only at a specific depth, etc, because the code i tried, based on several answers here, was that way too). I'm sorry about that :( – Olivier Dulac Nov 08 '13 at 17:00
  • 2
    Regex and XML will end in tears. Isn't XSLT a better option? – Rubens Farias Nov 08 '13 at 17:43
  • @TLP I updated the intro of the question to try to adress the comments. I'm not "trying to avoid" posting code, I really think it would create a XYProblem: I prefer to state what I need, than show what I tried (as I **know** that what I tried so far is not the right direction. I couldn't modify them to do 1) a seach at any depth within the XML and 2) only select node names matching a regexp part (in a few weeks, when I become proficient in Perl, it will hopefully be laughably easy, but right now I don't know how. I need pointers. I also search on the web and start reading more on twig/libXML) – Olivier Dulac Nov 08 '13 at 17:43
  • @RubensFarias: I am fully aware that not to use the regexp to parse the XML! (I read the funny top answer on so ^^) I don't want that at all. I edited my question tu put it more clearly: i need to 1) parse the XML (using a library! twig, libXML, or whatever else is best), and find at any depth nodes whose name match some regexp. And only replace the corresponding value for the nodes who matched. – Olivier Dulac Nov 08 '13 at 17:47
  • 1
    @OlivierDulac : Just post up what you have, even if it is broken and wrong - that should open the floodgates to some good answers. Folks around here tend to go the extra mile to show you nifty tips on how things should be done or could be done. – Zaid Nov 08 '13 at 18:17
  • @Zaid: really, it would point to the wrong direction. I *know* that what I have is not just syntactically, but also semantically incorrect. Ie, it is NOT trying to do what I want it to do. I need what I posted above, but my tries are not doing that right now... I'll need to understand libXML or twig enough to know how to change my tries to attempt the above. *then* I'll post something (or probably won't need to any longer ^^). Posting now would be doing a **XYproblem** post... – Olivier Dulac Nov 08 '13 at 18:55

1 Answers1

2

There is an example in the documentation for XML::LibXML::XPathContext for finding all nodes whose names match a given regex:

my $perlmatch = sub {
    die "Not a nodelist"
        unless $_[0]->isa('XML::LibXML::NodeList');
    die "Missing a regular expression"
        unless defined $_[1];

    my $nodelist = XML::LibXML::NodeList->new;
    my $i = 0;
    while(my $node = $_[0]->get_node($i)) {
        $nodelist->push($node) if $node->nodeName =~ $_[1];
        $i ++;
    }

    return $nodelist;
};

my $xc = XML::LibXML::XPathContext->new($node);
$xc->registerFunction('perlmatch', $perlmatch);
my @nodes = $xc->findnodes('perlmatch(//*, "foo|bar")');

The function perlmatch allows you to find nodes like this:

<someName>someValue</someName>

The key line in the function is:

$nodelist->push($node) if $node->nodeName =~ $_[1];

This takes an XML::LibXML::Node and evaluates the given regex against the node's name. With some modification, you could match against the value of the name attribute or search the attributes list for a match. I'll leave that as an exercise for the reader, but the following method should get you started:

$node->attributes();
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
  • thanks a lot! That looks exactly like what I needed to know. I'll whip up a script based on your indication, and will come back with more questions or to give a checkmark ^^ – Olivier Dulac Nov 12 '13 at 11:06