-2

I am trying to pull text between two tags <example>text</example>. I found a post which can do this using regular expression; however, when I try and use this in Python I am forced to escape characters.

original regex : run = re.findall("(?<=(<runs>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</runs>))", text)

FullCode:

#text is a text file but there is too much data to process to post it here
text = "<os>Windows Vista or Windows 7</os><filename>AS_ENGINE.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:34Z</atime><runs>1</runs><filenames><file>
<os>Windows Vista or Windows 7</os><filename>CHRMSTP.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:15:32Z</atime><runs>2</runs><filenames>
<os>Windows Vista or Windows 7</os><filename>RUNDLL32.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:07:35Z</atime><runs>1</runs><filenames><file>"

soup = BeautifulSoup(text, "lxml")

          for x in soup.find_all("runs"):
            print("Orginal ", x)

          for x in soup.find_all("dir"):
            print("Orginal ", x)

           for x in soup.find_all("filename"):
            print("Orginal ", x)

I then want to write certain tags to csv...


fieldnames = 'File Nmae','Number of runs','File Path'
    with open("C:\\ProgramData\\processed\\winprefetch.csv", 'w', newline='', encoding="utf8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(fieldnames)
        writer.writerows([[diskimage_name * row], filename, numberofruns,file]

Ac3
  • 55
  • 8

2 Answers2

3

Parsing XML with regex is a poor approach. Python has an XML parsing library called Beautiful Soup that will perform this task accurately:

from bs4 import BeautifulSoup

text = '<filename>MPSIGSTUB.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:33Z</atime><runs>1</runs><filenames><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CNTDLL.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNEL32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CAPISETSCHEMA.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNELBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CLOCALE.NLS</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL\x5CMPSIGSTUB.EXE</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CADVAPI32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CMSVCRT.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CSECHOST.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CRPCRT4.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CVERSION.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CCRYPTBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP\x5CMPSIGSTUB.LOG</file></filenames><volume><path>\x5CDEVICE\x5CHARDDISKVOLUME1</path><creation>2019-04-28T22:00:18Z</creation><serial_number>84c53be0</serial_number><dirnames><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5C$EXTEND</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP</dir></dirnames></volume>'

soup = BeautifulSoup(text, "lxml")

print(soup.find("runs").text)

for x in soup.find_all("dir"):
    print(x) # or x.text if you're only interested in the element contents

Output:

1
<dir>\DEVICE\HARDDISKVOLUME1\$EXTEND</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION\DOWNLOAD</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION\DOWNLOAD\INSTALL</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SYSTEM32</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\TEMP</dir>
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • This is great @ ggorlen!! is there a way just to get the number or directory between the tags? so ```1``` to return ```1 ``` – Ac3 May 05 '19 at 20:41
  • Use `.text` to access the text of an element. – ggorlen May 05 '19 at 20:45
  • I donst work when I use ```print(soup.find_all("runs").text)``` I get the following error : ```ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"``` – Ac3 May 05 '19 at 21:34
  • `find_all` returns a list. Use `[x.text for x in soup.find_all("runs")]` or write a `for` loop over the list that `find_all` returns as I've done in the above example. – ggorlen May 05 '19 at 21:43
  • 2
    Well, you've completely changed your question, invalidating the answers here. I don't recommend that--you should roll back the edit and ask a new question about BeautifulSoup using the code that's causing your new problems. – ggorlen May 05 '19 at 22:35
0

Try this:

import re
text ="<filename>MPSIGSTUB.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:33Z</atime><runs>1</runs><filenames><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CNTDLL.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNEL32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CAPISETSCHEMA.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNELBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CLOCALE.NLS</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL\x5CMPSIGSTUB.EXE</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CADVAPI32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CMSVCRT.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CSECHOST.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CRPCRT4.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CVERSION.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CCRYPTBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP\x5CMPSIGSTUB.LOG</file></filenames><volume><path>\x5CDEVICE\x5CHARDDISKVOLUME1</path><creation>2019-04-28T22:00:18Z</creation><serial_number>84c53be0</serial_number><dirnames><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5C$EXTEND</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP</dir></dirnames></volume>"
#regx
find = re.findall("(?<=(<runs>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]\"'+–/\/®°⁰!?{}|`~]| )+?(?=(</runs>))", text)
print(find)

You were pretty close—it looks like you were having trouble with the "s. Also, I think the regex could be simplified, although I don't know the details of your problem. For instance:

import re
text ="<filename>MPSIGSTUB.EXE</filename><runs>0</runs>asdf<runs>1</runs>"
#regx
matches = re.finditer("<runs>(.*?)</runs>", text)
for match in matches:
    print(match.group(1))
# output: 
# 0
# 1
Cam
  • 14,930
  • 16
  • 77
  • 128
  • Hi @Cam your example does work but when I try to apply to my code it dosnt returns correctly, I will update my code above to show you the full picture. – Ac3 May 05 '19 at 18:57