Removing Extra characters from XML code with python or PowerShell

Question

I have many XML files saved in a structure like below.

#$Dummy$#<MAIN location='Loc-01'>
--- Other tags & Elements ---
</MAIN>

Notice the characters #$Dummy$# at the beginning. this is inserted purposefully, in-order to protect any intruders parsing & reading the data out. Apart from this, rest of the data is purely xml. And the files are saved with .xml extension. I Know how to parse xml with Element Tree.

In this case Element Tree throws the below error

ParseError: not well-formed (invalid token): line 1, column 2

At present we open the files with any text editor & manually remove the characters. How can I remove this code with Python or Power Shell. as there are 1000's of files to parse.

By definition, XML files are well-formed. If those characters reside outside nodes like that, that is not an XML file. Consider checking how such materials were generated, hopefully with a compliant DOM library. — Parfait, Aug 29 '19 at 20:59

mklement0 · Accepted Answer · 2019-08-30T16:35:05.427

In this simple case text processing via regular expressions sounds like the right approach, as in the following PowerShell solution (by definition you cannot parse your files as XML as-is, given the extraneous text before the well-formed XML):

Get-ChildItem -Filter *.xml | ForEach-Object {
  $file = $_.FullName
  (Get-Content -Raw $file) -creplace '^#\$Dummy\$#' | Set-Content -NoNewLine $file
}

Important: Set-Content uses a default character encoding, irrespective of the original input file's encoding; in Windows PowerShell, that is the active ANSI code page; more sensibly, it is BOM-less UTF-8 in PowerShell Core. Use the -Encoding parameter as needed.

Get-ChildItem -Filter *.xml returns all *.xml files in the current folder; tweak this command as needed; see Get-ChildItem's help.
Get-Content -Raw $file reads the entire file into memory as a single string; see Get-Content's help
-creplace case-sensitively (c) matches literal string #$Dummy$# string (escaped for the regex as #\$Dummy\$#, because $ otherwise has special meaning) at the very start (^) of the input and implicitly replaces it with the empty string (since no replacement operand is given), which effectively removes it.
- For more information about PowerShell's -replace operator, see this answer .
Set-Content writes the (possibly modified) string back to $file. -NoNewLine (PSv5+) prevents an extra newline from getting appended.

@mklement() this code works only when the `#$Dummy$#` characters are at the beginning. But, if it is at the end. it doesn't work. how to modify in such case? — Tommy, Aug 30 '19 at 11:34
@Tommy: Glad to hear it; good point re extra newline: you need to use `Set-Content -NoNewLine` to prevent that (PowerShell v5+ only) - I've updated the answer. — mklement0, Aug 30 '19 at 16:24

score 0 · Answer 2 · answered Aug 29 '19 at 20:05

You could use something like this in python if your character patterns are simple you can use the replace method as below if not you may have to import something like regex to complete the task. This also assumes all the files are in one directory.

    import os
    path = "/directory"
    the_files = os.listdir(path)
    bad_chars = ["( )", " )( "]
    for a_file in the_files:
        file = open(a, 'r+')
        line = file.read
        for char in bad_chars:
            line.replace(chars)
        file.write(line)
        file.close()

Removing Extra characters from XML code with python or PowerShell

2 Answers2