-1

I want to get rid of the xml-code from within more than 100 xml-files. I want to use PowerShell. Here is one sample file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../../../helpproject.xsl" ?><topic 
template="Default" lasteditedby="liliya" xmlns:xsi="http://www.w3.org
/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../..
/../helpproject.xsd">

<title translate="true">Passwörter verwalten</title>

<body>
<header>
  <para styleclass="Heading1"><text styleclass="Heading1" 
translate="true">Passwörter verwalten</text></para>
</header>
<para styleclass="Normal"><table styleclass="container" rowcount="3" 
colcount="2" style="width:970px;">
  <tr style="vertical-align:top">
    <td style="width:50%;">
      <para styleclass="H1"><text styleclass="H1" 
translate="true">Passwörter verwalten</text></para>
    </td>
    <td style="width:50%;">
      <para styleclass="Image"><image src="manage_passwords.PNG" 

scale="100.00%" styleclass="Image"><title translate="true">Passwörter 
verwalten</title></image></para>
    </td>
  </tr>
</table></para>
<para styleclass="txt"/>

In Notepad++ after regex of <.+?> and ^\s+ I see just the text!

With this script I copy the originals (to leave them unchanged) to a single folder and then O just want to eliminate the xml-tags:

Get-ChildItem -Path "C:\Users\cas\Documents\Wurzel_XML\" -Recurse |
Where-Object Name -like "*.xml" | 
Copy-Item -Destination "C:\Users\cas\Documents\check_xml\"

$newText = ($newText -replace "<.*?>", "").trim()|?{$_ -ne ''} 
Get-ChildItem -Path "C:\Users\cas\Documents\check_xml\" |
    Set-Content -Value $newText

But after that all the files are completely empty?

I previously tried

$newText = ($newText -replace "(?ms)^\s+<.*?</.*?>", "")
Get-ChildItem -Path "C:\Users\cas\Documents\check_xml\" |
    Set-Content -Value $newText

with the same result.
What do I wrong with that Regex?
Thanks in advance,
Gooly

gooly
  • 1,241
  • 7
  • 20
  • 38
  • 4
    It's not clear what you're trying to accomplish. Can you provide examples of what your input looks like, and what you expect your output to look like? Properly-structured "XML files" contain "XML code"; if you remove the XML code from an XML file, you _will_ have an empty file. – Jeff Zeitlin Jan 31 '18 at 13:21
  • 1
    Regular expressions **are not** the appropriate tool to edit XML content. – axiac Jan 31 '18 at 13:31
  • 1
    Completely agree, it's [a terrible idea](https://stackoverflow.com/a/1732454/712649) – Mathias R. Jessen Jan 31 '18 at 13:32
  • *No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause...* --- This link gets posted 100s of times every day, yet we still see an **endless** stream of "how can I parse XML with regex" questions :( – Tom Lord Jan 31 '18 at 13:35
  • Maybe StackOverflow should add a prompt, that asks all users mentioning "regex" and "xml/html" in a question to read that answer before clicking "submit". – Tom Lord Jan 31 '18 at 13:37
  • I am translating text of the xml-files I am not editing the xml-code! I need to proof read the text that will later appear elsewhere. The text is just 20%-25% of the size of the xml-files. So I just need the pure text to be able to read what other later will read. They don't see the xml-code either! But if you don't like the idea just help me to that crazy thing - deal? – gooly Jan 31 '18 at 13:39
  • I tried to add xml-code but I wasn't able! The site knows css html javascript but no xml – gooly Jan 31 '18 at 13:40
  • @gooly gladly, but as Jeff Zeitlin mentioned we'll need a bit more information - please show us a sample input and ouput at least. If the xml doesn't render, please [leave it in your question anyways, and we'll help you edit it](https://stackoverflow.com/questions/48543131/eliminate-the-xml-code-from-xml-files) - don't post it in comments – Mathias R. Jessen Jan 31 '18 at 13:41
  • @TomLord where is that link - I haven seen any – gooly Jan 31 '18 at 13:47
  • @gooly once again, please [edit the full xml into your question](https://stackoverflow.com/questions/48543131/eliminate-the-xml-code-from-xml-files#) - select it all and press `Ctrl + K` to format as code – Mathias R. Jessen Jan 31 '18 at 13:49
  • @JeffZeitlin: Well in Notepad+ I just do two Regex-replacements: <.+?> and ^\s+ and I see the pure text - the problem is that I don't want to do it separately for more that 100 xml-files! – gooly Jan 31 '18 at 13:51
  • @gooly for the third time: please [add the sample xml to your question](https://stackoverflow.com/posts/48543131/edit) if you want qualified help – Mathias R. Jessen Jan 31 '18 at 13:56
  • That's not a valid xml document – Mathias R. Jessen Jan 31 '18 at 14:00
  • The Files I am dealing with look like that. From one of them I took the first part deleted some parts in the middle and left the last part. – gooly Jan 31 '18 at 14:08
  • 1
    So you mean to say that actually all of your XML files are invalid? Because what you posted most definitely is not valid XML. – Ansgar Wiechers Jan 31 '18 at 14:13
  • @gooly The link was in the comment directly above mine. It's also shown in the top right of this page, under the sub-heading: "Linked". *Do not use regex to edit XML.* – Tom Lord Jan 31 '18 at 15:55

1 Answers1

1

Do Not Use Regular Expression Processing To Parse HTML, XHTML, or XML

PowerShell has cmdlets that can be used to process XML, and the techniques that can be used with it have been discussed in many places (See this Google search). If you read your files as structured XML files, and then use the Select-XML cmdlet with appropriate XPath queries, you can extract the information you need, reliably - provided that your XML is well-formed in the first place.

Community
  • 1
  • 1
Jeff Zeitlin
  • 9,773
  • 2
  • 21
  • 33
  • All the examples deal with the nodes (structure,info,tabulars,..) - but this is not what I need - I just want to eliminate the xml-stuff from the xml file - I want to treat the as simple text files as they (or their copies) aren't used as xml-files at all. – gooly Jan 31 '18 at 14:45
  • Then you're not understanding what XML is, or how it works - because you appear to want to extract CONTENT from nodes that have it - which is possible when processing XML with `Select-XML` - look at the MS Docs link I gave you in the answer. One of the examples specifically shows how to do it. – Jeff Zeitlin Jan 31 '18 at 14:49
  • Do you mean this one: Select-Xml -Content $Xml -XPath "//edition" | foreach {$_.node.InnerXML} ? Well I don't know the name of all nodes of more than 100 xml-files and without it I am asked to enter the name of a node :( – gooly Jan 31 '18 at 15:08
  • Yes, that's the example that I was thinking of. I haven't checked in detail, but I'm quite sure there's a way to select any element that has content, which is what you'd want to do. – Jeff Zeitlin Jan 31 '18 at 15:18
  • 1
    @gooly `([xml](Get-Content 'C:\path\to\your.xml')).DocumentElement.InnerText` will give you the raw text of an XML file, without the need to know any particular node name. Provided your XML is actually valid, that is. – Ansgar Wiechers Jan 31 '18 at 15:38
  • Thanks - but how can I loop through more that 100 xm-files replacing their content by: ([xml](Get-Content 'C:\path\to\your.xml')).DocumentElement.InnerText ? – gooly Jan 31 '18 at 17:13
  • That is a completely different question but essentially ***foreach($file in Get-ChildItem PATHTOFOLDER){}*** – EBGreen Jan 31 '18 at 17:20