-2

I'm using Python3, Linux Mint and Visual Studio Code.

I have some code that reads a directory and prints some xml files like so:

persistence_security_dcshadow_4742.xml
Network_Service_Guest_added_to_admins_4732.xml
spoolsample_5145.xml
LM_Remote_Service02_7045.xml
DE_RDP_Tunneling_4624.xml

I'm trying to figure out how to write so that only the integers remain after I have run this read script, i.e., removing all text with only numbers remaining. I tried using regular expression with use of the import re module but didn't have much luck.

Shaido
  • 27,497
  • 23
  • 70
  • 73

2 Answers2

0

This is not the most robust of solutions, but if the data is always exactly in this form, you can split on underscores, take the last element, then split it on decimals, and take the first element:

>>> line = "persistence_security_dcshadow_4742.xml"
>>> line.split("_")[-1].split(".")[0]
'4742'

Then, if you need it as a number, you just need to parse it using int.

You may want to add some error handling unless you know the data is clean.

Carcigenicate
  • 43,494
  • 9
  • 68
  • 117
0

Use regex with [0-9]

import re

regex = r'[0-9]+'

xmls = [
    'persistence_security_dcshadow_4742.xml',
    'Network_Service_Guest_added_to_admins_4732.xml',
    'spoolsample_5145.xml',
    'LM_Remote_Service02_7045.xml',
    'DE_RDP_Tunneling_4624.xml',
]

for xml in xmls:
    matches = re.findall(regex, xml)
    number = matches[-1]
    print(number)
> 4742
> 4732
> 5145
> 7045
> 4624

UPDATE

If you want to print the numbers only after all the files have been read, then you can create a function that takes a list of xml files and returns the corresponding number for each file

import re

def xmls_to_numbers(xmls):
    regex = r'[0-9]+'
    numbers = [ ]
    for xml in xmls:
        matches = re.findall(regex, xml)
        number = matches[-1]
        numbers.append(number)
    return numbers


xmls = [
    'persistence_security_dcshadow_4742.xml',
    'Network_Service_Guest_added_to_admins_4732.xml',
    'spoolsample_5145.xml',
    'LM_Remote_Service02_7045.xml',
    'DE_RDP_Tunneling_4624.xml',
]

print(xmls_to_numbers(xmls))

> ['4742', '4732', '5145', '7045', '4624']

Amine Messaoudi
  • 2,141
  • 2
  • 20
  • 37
  • This looks great! Although I shortened it in the question, there are 150+ xml files which print! Would there be a method similar to the one you have written that is more suitable for bulk extraction of integers? The script prints the xml files to the terminal – Number_S1x Feb 01 '21 at 16:40
  • I see. I guess you want to hide the numbers from the terminal. Is that correct ? – Amine Messaoudi Feb 01 '21 at 16:42
  • Not quite. Just having it so that after the script has read out the xml files, like you did in your writings, that only the integers remain and the text is discarded. I can post my script if that would help! – Number_S1x Feb 01 '21 at 16:44
  • Yes that would help – Amine Messaoudi Feb 01 '21 at 16:47
  • `for root, dirs, files in os.walk('/home/user/CI5235_K1915147_Sam/evtx_logs/'): for file in files: if file.endswith('.xml'): print(file)` – Number_S1x Feb 01 '21 at 16:50
  • I'm not sure why it posts the code like that. The above code is what i'm using to search a directory to see whether an xml file exists if so print it – Number_S1x Feb 01 '21 at 16:50
  • Do you want to print the numbers only after reading the files ? I mean after the foor loop ? – Amine Messaoudi Feb 01 '21 at 16:54
  • Exactly that @Amine Messaoudi After it's read the files, print the numbers found within the filenames. So extracting the integers – Number_S1x Feb 01 '21 at 16:56
  • Please see my updated answser. – Amine Messaoudi Feb 01 '21 at 17:05
  • that makes sense now! So would I have to list each xml file individually, in the xmls variable underneath the function? – Number_S1x Feb 01 '21 at 17:09
  • Thank you so much for your help with this so far!! – Number_S1x Feb 01 '21 at 17:09