2

How can I accomplish something like this:

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date=(Get-Date | ConvertTo-Xml)                                         
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date

xml                            Objects
---                            -------
version="1.0" encoding="utf-8" Objects

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date.OuterXml
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.DateTime">12/12/2020 2:43:46 AM</Object></Objects>
PS /home/nicholas/powershell> 

but, instead, reading in a file?


how do I load/import/read/convert an xml file using ConvertTo-Xml for parsing with Select-Xml using Xpath?

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml=ConvertTo-Xml ./bookstore.xml
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml                              

xml                            Objects
---                            -------
version="1.0" encoding="utf-8" Objects

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml.InnerXml                     
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.String">./bookstore.xml</Object></Objects>
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml.OuterXml                     
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.String">./bookstore.xml</Object></Objects>
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> cat ./bookstore.xml

<?xml version="1.0"?>
<!-- A fragment of a book store inventory database -->
<bookstore xmlns:bk="urn:samples">
  <book genre="novel" publicationdate="1997" bk:ISBN="1-861001-57-8">
    <title>Pride And Prejudice</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>24.95</price>
  </book>
  <book genre="novel" publicationdate="1992" bk:ISBN="1-861002-30-1">
    <title>The Handmaid's Tale</title>
    <author>
      <first-name>Margaret</first-name>
      <last-name>Atwood</last-name>
    </author>
    <price>29.95</price>
  </book>
  <book genre="novel" publicationdate="1991" bk:ISBN="1-861001-57-6">
    <title>Emma</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>19.95</price>
  </book>
  <book genre="novel" publicationdate="1982" bk:ISBN="1-861001-45-3">
    <title>Sense and Sensibility</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>19.95</price>
  </book>
</bookstore>

PS /home/nicholas/powershell> 

Creating the xml file within the REPL console itself works as expected:

How to parse XML in Powershell with Select-Xml and Xpath?

  • 1
    `$xml = [xml]( Get-Content .\bookstore.xml -raw ); $xml | Select-Xml YourXPath` – zett42 Dec 12 '20 at 10:47
  • 1
    @zett42 No, don't use `Get-Content` and cast the result to XML. This is the single most common error I see when people read XML in PowerShell. Use `$doc = New-Object xml; $doc.Load('path.to.xml');`. This deals with file encodings properly. Using `Get-Content` happily mangles your data. – Tomalak Dec 12 '20 at 10:53
  • @Tomalak Even with `Get-Content -raw`? – zett42 Dec 12 '20 at 11:01
  • 1
    @zett42 Yeah, even then. See my answer for the gist of it. – Tomalak Dec 12 '20 at 11:08
  • 1
    @Tomalak Got it. Propably just got lucky because most XML documents are UTF-8 encoded, which happens to be the default encoding used by `Get-Content`. – zett42 Dec 12 '20 at 11:18
  • 1
    @zett42 Nowadays. Earlier versions of PS defaulted to whatever "ANSI" default encoding your system had, in Europe/the US likely Windows-1252. `Get-Content` pays attention to the BOM, so it will recognize UTF-16 unaided, but UTF-8 downloaded from the Internet usually has no BOM. And `Get-Content` will continue to butcher "foreign" single-byte encodings. Ultimately, it really is luck when it works. And it's entirely unnecessary to rely on luck with XML when transparent encoding detection is a fundamental part of the spec. – Tomalak Dec 12 '20 at 11:25

1 Answers1

11

Properly reading an XML document in Powershell works like this:

$doc = New-Object xml
$doc.Load( (Convert-Path bookstore.xml) )

XML can come in numerous file encodings, and using the XmlDocument.Load method makes sure the file is read properly without prior knowledge of the encoding.

Not reading a file with the correct encoding will result in mangled data or errors except in very basic or very lucky cases.

The often-seen method of using Get-Content and casting the resulting string to [xml] is the wrong way of dealing with XML for this very reason. So don't do that.

You can get a correct result with Get-Content, but that requires

  1. Prior knowledge of the file encoding (e.g. Get-Content bookstore.xml -Encoding UTF8)
  2. Hard-coding the file encoding into your script (meaning it will break if the XML encoding ever changes unexpectedly)
  3. Limiting yourself to the very few file encodings that Get-Content supports (XML supports more)

It means you put yourself in a position where you have to manually think about and solve a problem that XML has been specifically designed to automatically handle for you.

Doing things correctly with Get-Content is a lot of unnecessary extra work and limitations. And doing things incorrectly is pointless when doing it right is so easy.


Examples, after loading $doc like shown above.

$doc.bookstore.book

prints a list of <book> elements and their properties

genre           : novel
publicationdate : 1997
ISBN            : 1-861001-57-8
title           : Pride And Prejudice
author          : author
price           : 24.95

genre           : novel
publicationdate : 1992
ISBN            : 1-861002-30-1
title           : The Handmaid's Tale
author          : author
price           : 29.95

genre           : novel
publicationdate : 1991
ISBN            : 1-861001-57-6
title           : Emma
author          : author
price           : 19.95

genre           : novel
publicationdate : 1982
ISBN            : 1-861001-45-3
title           : Sense and Sensibility
author          : author
price           : 19.95

$doc.bookstore.book | Format-Table

prints the same thing as a table

genre publicationdate ISBN          title                 author price
----- --------------- ----          -----                 ------ -----
novel 1997            1-861001-57-8 Pride And Prejudice   author 24.95
novel 1992            1-861002-30-1 The Handmaid's Tale   author 29.95
novel 1991            1-861001-57-6 Emma                  author 19.95
novel 1982            1-861001-45-3 Sense and Sensibility author 19.95

$doc.bookstore.book | Where-Object publicationdate -lt 1992 | Format-Table

filters the data

genre publicationdate ISBN          title                 author price
----- --------------- ----          -----                 ------ -----
novel 1991            1-861001-57-6 Emma                  author 19.95
novel 1982            1-861001-45-3 Sense and Sensibility author 19.95

$doc.bookstore.book | Where-Object publicationdate -lt 1992 | Sort publicationdate | select title

sorts and prints only the <title> field

title                
-----                
Sense and Sensibility
Emma

There are many more ways of slicing and dicing the data, it all depends on what you want to do.

Code Doggo
  • 2,146
  • 6
  • 33
  • 58
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • but, now it's allononlineoftextandiskindahardtoread. How do print it out nicely? – Nicholas Saunders Dec 12 '20 at 11:07
  • @Nicholas What do you want to print out nicely? Values from the XML? The XML itself? What's the overall goal you want to achieve? – Tomalak Dec 12 '20 at 11:14
  • it would be convenient to pretty print the raw xml (as with xmllint) if that's built-in to powershell. see also https://stackoverflow.com/q/65264292/4531180 for ultimate goal. (the printing of xml would just be for convenience.) – Nicholas Saunders Dec 12 '20 at 11:19
  • 1
    I would add that one gets aways with the `Get-Content` method most of the time only because most XML documents are UTF-8 encoded, which happens to be the default encoding used by `Get-Content`. Of course this is bad "programming by chance" and should be avoided. I guess most people are using it because they like one-liners. So if we could provide a one-liner for the correct method, this could encourage more people to use it. – zett42 Dec 12 '20 at 11:25
  • 1
    @Nicholas PowerShell runs on top of .NET. Anything that can be done with .NET can be done with PowerShell, give or take. Pretty-printing XML is certainly possible, but I doubt that that's what you really need. You want to work with the contained values somehow, and outputting a nice table of values is both easier and more useful than printing out an indented XML tree. I'll add an example to my answer. – Tomalak Dec 12 '20 at 11:31
  • It's embarassing how many tutorials and even highly up-voted, top-ranked SO answers promote the wrong `Get-Content` method. E. g. https://stackoverflow.com/a/18509715/7571258 – zett42 Dec 12 '20 at 11:49
  • see also: https://www.powershellmagazine.com/2013/08/19/mastering-everyday-xml-tasks-in-powershell/ – Nicholas Saunders Dec 12 '20 at 11:51
  • 1
    @zett42 It really depends on whether you care about writing correct code or not. What's the best one-liner, the fastest loop, the quickest corner to cut worth when the result is incorrect? People are not using `Get-Content` because it's a one-liner, but because they don't care (or know) about encodings, because it has always worked on their machine, and because it's all over the Internet and they've just copy-pasted it like the rest of their code. ;) But `$doc = New-Object xml; $doc.Load($path)` (or `$doc = [xml]::new(); $doc.Load($path)`) fits on one line, so there's that. – Tomalak Dec 12 '20 at 12:02
  • @zett42 And yeah, it is embarrassing how regularly people get this one wrong. It's a hopeless fight, really, just as hopeless as trying to spread the word that regex cannot handle HTML and that every minute trying to do it anyway is wasted. There are just too many bad examples out there. – Tomalak Dec 12 '20 at 12:04
  • A pipable one-liner would be preferable for me. `Select-Xml -Path` comes close but strangely enough seems to have the same encoding issue as `Get-Content` (just tested with a "windows-1251" encoded XML file, containing cyrillic letters, which is no problem for `[xml]::Load()` method). – zett42 Dec 12 '20 at 15:52
  • 2
    @zett42 That's amazing! I've never tried it, but `Select-Xml` actually messes this up (I tried, my PS Version is 5.1.18362). This is an actual bug in PowerShell, and an embarrassing one, too. – Tomalak Dec 12 '20 at 16:15
  • ...regarding the "pipe-ability" - in a script, I don't think it's a huge drawback to have one more line. Directly on the command line for one-offs... I'd call it a minor inconvenience. Overall, knowing the rules is necessary in order to know when you can break them. – Tomalak Dec 12 '20 at 16:29
  • 3
    I've created a bug report for the `Select-Xml` encoding problem: https://github.com/PowerShell/PowerShell/issues/14404 – zett42 Dec 12 '20 at 18:48
  • @mklement0 Can `XmlDocument.Load()` handle PowerShell drives specifications? – Tomalak Dec 28 '20 at 07:52
  • No, .NET APIs know nothing about PowerShell drives (and the PowerShell engine doesn't try to translate them for method calls). The bigger problem is the lack of synchronization of working dirs. between PowerShell and .NET - see https://github.com/PowerShell/PowerShell/issues/3428 - which alone necessitates passing _full_ paths to .NET methods, and `Convert-Path` is the right tool for that, due to resolving to a _native_ path - can I suggest you update your answer accordingly? – mklement0 Dec 28 '20 at 12:50
  • 1
    @mklement0 Ah, got it. That's exactly the reason why I've used `Resolve-Path`, interesting to learn that it's the wrong tool. Go ahead and edit! – Tomalak Dec 28 '20 at 13:16
  • 2
    @zett42, doing what is undoubtedly the right and most robust thing here is indeed so cumbersome - and obscure - that people will keep taking the `[xml] (Get-Content -Raw ...)` shortcut, unless we provide a PowerShell-idiomatic alternative that is both robust and convenient: please see https://github.com/PowerShell/PowerShell/issues/14505 – mklement0 Dec 28 '20 at 13:52