Exact string search in XML files?

Question

I need to search into some XML files (all of them have the same name, pom.xml) for the following text sequence exactly (also in subfolders), so in case somebody write some text or even a blank, I must get an alert:

     <!--
     | Startsection
     |-->         
    <!-- 
     | Endsection
     |-->

I'm running the following Python script, but still not matching exactly, I also get alert even when it's partially the text inside:

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->")
tag="<module>"

for root, dirs, files in os.walk("."):

    if "pom.xml" in files:
        p=join(root, "pom.xml") 
        print("Checking",p)
        with open(p) as f:
            s=f.read()
        if tag in s and comment.search(s):
            print("Matched",p)

UPDATE #3

I am expecting to print out, the content of tag <module> if it exists between |--> <!--

into the search:

 <!--
 | Startsection
 |-->         
 <!-- 
 | Endsection
 |-->

for instance print after Matched , and the name of the file, also print "example.test1" in the case below :

     <!--
     | Startsection
     |-->         
       <module>example.test1</module>
     <!-- 
     | Endsection
     |-->

UPDATE #4

Should be using the following :

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("/home/temp/test_folder/"):
 for skipped in ("test1", "test2", ".repotest"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("The following files are corrupted ",p)

UPDATE #5

import re
import os
import xml.etree.ElementTree as etree 
from bs4 import BeautifulSoup 
from bs4 import Comment

from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("myfolder"):
 for skipped in ("model", "doc"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("ERROR: The following file are corrupted",p)



bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

Please don't parse XML with regular expressions. It's a terrible idea and it makes experienced programmers weep. Try [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) or its underlying library [lxml](https://pypi.python.org/pypi/lxml) — Adam Smith, Aug 17 '16 at 23:32
Im thinking to store the exactly sequence in a external file. How can I implement it? can you help me with this?Thanks! — user2961008, Aug 17 '16 at 23:36
@AdamSmith, ...the difficulty here is that they want to find a comment, so it's not something that actually shows up in a DOM tree. — Charles Duffy, Aug 18 '16 at 12:31
BTW, when creating a new question closely linked to an old one (in this case, a Python-rather-than-shell instance of http://stackoverflow.com/questions/38958403/find-xml-files-non-containing-a-specific-comment-from-shell/38961603) it's considered good form to include a link, and describe explicitly what distinguishes them. — Charles Duffy, Aug 18 '16 at 12:33
Sorry but i think the question is different, itas about print out the content of a tag if it exist between comments tags. Please see last update/example. Thanks! — user2961008, Aug 18 '16 at 12:34
@CharlesDuffy comments can be parsed in both XPath and XSLT with the [`comment()`](http://stackoverflow.com/questions/784745/accessing-comments-in-xml-using-xpath) function. — Parfait, Aug 18 '16 at 12:48
Some help how to implement the update #3 into the code of update #4 wihtout additional package installation(no beautifulsoap ...)?? Thanks! — user2961008, Aug 18 '16 at 15:20
Im trying in another machine with Beautiful soap the code of Update #5 , but still getting that error, some help please?? : Traceback (most recent call last): File "python9.py", line 27, in comments=soup.find_all(string=lambda text:isinstance(text,Comment)) NameError: name 'soup' is not defined — user2961008, Aug 19 '16 at 00:20
I am expecting to print out, the content of tag if it exists between |--> — user2961008, Aug 19 '16 at 00:22

score 1 · Accepted Answer · edited May 23 '17 at 12:00

1

Don't parse a XML file with regular expression. The best Stackoverflow answer ever can explain you why

You can use BeautifulSoup to help on that task

Look how simple would be extract something from your code

from bs4 import BeautifulSoup

content = """
    <!--
     | Start of user code (user defined modules)
     |-->

    <!--
     | End of user code
     |-->
"""

bs = BeautifulSoup(content, "html.parser")
print(''.join(bs.contents))

Of course you can use your xml file instead of the literal I'm using

bs = BeautifulSoup(open("pom.xml"), "html.parser")

A small example using your expected input

from bs4 import BeautifulSoup
from bs4 import Comment

bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

But if your code is always in a module tag I don't know why you should care about the comments before/after, you can just find the code inside the module tag directly

edited May 23 '17 at 12:00

Community

1
1

answered Aug 18 '16 at 09:58

dfranca

5,156
2
32
60

Is it possible for those cases that we are printing because they match, print also the content written between |--> AND – user2961008 Aug 18 '16 at 11:18
Yes, you can call .text or .find, refer to the documentation for a complete overview of BS API: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – dfranca Aug 18 '16 at 11:28
Thanks! And for the previous question? some tips too? – user2961008 Aug 18 '16 at 11:30
Im actually interested in print the content of the tag "module" , when it exist between |--> AND – user2961008 Aug 18 '16 at 11:33
to use in your code you need just call the BS constructor with the file you want to parse, then you can iterate over the structure finding the comments you want, the answer here can point you where to go: http://stackoverflow.com/questions/33138937/how-to-find-all-comments-with-beautiful-soup – dfranca Aug 18 '16 at 11:33
Thanks daniel, I want to print the content of the XML tag "module" , when it exist placed between |--> AND – user2961008 Aug 18 '16 at 11:47
To let you know, Im using the final code from: http://stackoverflow.com/questions/39013059/skip-directories-in-a-search-in-python/39013259?noredirect=1#comment65378217_39013259 – user2961008 Aug 18 '16 at 11:49
Can you paste the content of some file with what you're expecting? – dfranca Aug 18 '16 at 11:51
I added an update, please let me know if its not clear and thanks so much for your tips! :) – user2961008 Aug 18 '16 at 12:00
Ok, just updated the answer, let me know if it fix your code. – dfranca Aug 18 '16 at 12:21
The explanation is that I need to search between this tags of user defined modules, so i can see if sombody wrotte something. In case they wrotte some module, print out the content. Could you please help me to embed into last Update code, so i can test? Thanks so so much for your teach and help! – user2961008 Aug 18 '16 at 12:27
Embed should be very simple, the code is pretty straight forward, just copy it to your code, replacing the code opening and searching through the content of the file and test it. – dfranca Aug 18 '16 at 12:32
Unfortunatly I got IndentationError and also Traceback (most recent call last): File "python_script_8.py", line 19, in from bs4 import BeautifulSoup ImportError: No module named bs4... Some final help to embed it?? Thanks so much! – user2961008 Aug 18 '16 at 12:43
You've to install BeautifulSoup before use it: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup – dfranca Aug 18 '16 at 12:45
Daniel unfortunately im not allow to install additional packages neither have root privileges... some workaround? Thanks!! – user2961008 Aug 18 '16 at 13:05
You don't need root privilegie if you're using a python virtual environment. – dfranca Aug 18 '16 at 13:15
Unfortunately in OpenSuse and my current net looks restrcited.... zypper install python-beautifulsoup-3.0.8.1-8.2.noarch Root privileges are required for installing or uninstalling packages. – user2961008 Aug 18 '16 at 13:19
Somebody knows some workaround to reach the UPDATE & LAST UPDATE without install any additional package? Thanks!! – user2961008 Aug 18 '16 at 13:49
You can try using xml.etree, that's a standard library: https://docs.python.org/2/library/xml.etree.elementtree.html – dfranca Aug 18 '16 at 21:16
Danielfranca thanks for the hint. I was reading still trying to get all knowledge..I will appreciatte how to really looks into my code to be able to test asap... somebody can show me how to use it into the last UPDATE #4 from the 1st comment... im now stucked.. Thanks!! :)) – user2961008 Aug 18 '16 at 22:09

user2592704 · Answer 2 · 2016-08-18T15:24:37.130

0

The "|()" characters must be escaped, also add re.MULTILINE to the regex.

comment=re.compile(r"\s+", re.MULTILINE)

Edit: you can also place newline characters in your regex expression: \n

Arbitrary (or no) white space would be: \s*

You can find more information on python regex here: https://docs.python.org/2/library/re.html

edited Aug 18 '16 at 15:24

answered Aug 17 '16 at 23:38

user2592704

11
2

Great thanks! thats a good solution, but it´s possible to do it more restrictive? For instance if we writte an ENTER between the 3rd and 4rd line?? I would like also to cover that case if possible – user2961008 Aug 17 '16 at 23:42
Some tip please to do it as previous comment?? – user2961008 Aug 18 '16 at 07:20
Is is possible to detect also ENTER between the line 3 and 4 of this input? I can detect only if there is some character more or less, i would like to detect also spaces or TAB. Thanks! :)) – user2961008 Aug 18 '16 at 09:36

Exact string search in XML files?

2 Answers2

Linked