1

I would like to retrieve HTML from the clipboard via the command line and am struggling to get the encoding right.

For instance, if you open a command prompt/WSL, copy the following ⇧Shift+⭾TAB and run:

powershell.exe Get-Clipboard

The correct text is retrieved (⇧Shift+⭾TAB).

But if you then try to retrieve the clipboard as html:

powershell.exe "Get-Clipboard -TextFormatType html"

The following text is retrieved

...⇧Shift+⭾TAB...

This seems to be an encoding confusion on part of the Get-Clipboard commandlet. How to work around this?


Edit: As @Zilog80 indicates in the comments, indeed the encoding of the text does not match the encoding which is assumed the text has. I can rectify in Ruby for instance using:

out = `powershell.exe Get-Clipboard -TextFormatType html`
puts out.encode('cp1252').force_encoding('utf-8')

Any idea for how to achieve the same on the command line?

Christopher Oezbek
  • 23,994
  • 6
  • 61
  • 85
  • `⇧Shift+â­¾TAB` is the [UTF8 encoding](https://design215.com/toolbox/utf8-3byte-characters.php) of `⇧Shift+⭾TAB`. `⇧` are the UTF8 codes _0xE2 0x87 0xA7_ for "⇧", the same with `â¡­¾` which are the UTF8 codes _0xE2 0xAD 0xBE_ for "⭾". Normally, it's not a problem to have UTF8 chars directly in an HTML stream. The output of` `Get-Clipboard -TextFormatType html` is an array of strings which includes HTML strings, so it's expected. I guess you were expecting that the PowerShell console output of this string array will render the UTF8 HTML strings for your CLI current codepage ? – Zilog80 Jun 10 '21 at 09:14
  • Let me turn the question around: Why wouldn't I expect it? If I pipe the output to a file, the file can't be opened correctly. If I use `iconv` I don't know what to tell it is the source encoding. – Christopher Oezbek Jun 10 '21 at 09:21
  • I see, the problem come from the fact that it's a _fragment_ of an HTML page/stream, so you don't have the `` to give you the corresponding encoding of the HTML _fragment_. I guess the best way should be to get this one from the provided `SourceURL` string in the array, if the _fragment_ doesn't include it. – Zilog80 Jun 10 '21 at 09:29
  • This is indeed a shortcoming of `Get-Clipboard`. The HTML format is [documented](https://learn.microsoft.com/windows/win32/dataxchg/html-clipboard-format) to support only UTF-8, so the cmdlet should interpret it as such. As of PowerShell 7.1 the cmdlet has no `-TextFormatType` parameter at all, so it's unlikely to get fixed there, while the linked UserVoice page for the PS 5 branch seems to be inactive. – Jeroen Mostert Jun 10 '21 at 10:29
  • I'm speculating as to the encoding PowerShell is going to be using when decoding the data, but it's probably whatever the system default ANSI encoding is. In that case `[Text.Encoding]::UTF8.GetString([Text.Encoding]::Default.GetBytes((Get-Clipboard -TextFormatType Html -Raw)))` will recode the text, since we may assume UTF-8, but with the caveat that if the default ANSI encoding does not cover all code points from 0-255, some characters might get lost. Fortunately Windows-1252 (the most common default) does cover all code points. – Jeroen Mostert Jun 10 '21 at 10:33
  • @Jeroen: `As of PowerShell 7.1 the cmdlet has no -TextFormatType parameter at all` => Any indication how to retrieve RTF and HTML from Clipboard in the future? – Christopher Oezbek Jun 10 '21 at 10:58
  • That would be an issue for the [PowerShell repo](https://github.com/PowerShell/PowerShell), which covers 7+. Obviously this particular way of handling the clipboard is specific to Windows, which it's why it's not strange that there's no provisions for it in the platform-agnostic cmdlet (yet). – Jeroen Mostert Jun 10 '21 at 11:24
  • @JeroenMostert: Do you want to post `[Text.Encoding]::UTF8.GetString([Text.Encoding]::Default.GetBytes((Get-Clipboard -TextFormatType Html -Raw)))` as an answer for me to accept? Is `-Raw` necessary? – Christopher Oezbek Jun 10 '21 at 12:34
  • `-Raw` preserves newlines in the original, which may or may not be necessary depending on what you do with it, but is certainly cleaner -- without it the lines have to be concatenated by PowerShell again before getting passed to `GetBytes`, which is at the very least wasteful, and at worst screws up the format. – Jeroen Mostert Jun 10 '21 at 12:40

1 Answers1

3

This is indeed a shortcoming of Get-Clipboard. The HTML format is documented to support only UTF-8, regardless of the source encoding of the page, so the cmdlet should interpret it as such, but it doesn't.

I'm speculating as to the encoding PowerShell is going to be using when decoding the data, but it's probably whatever the system default ANSI encoding is. In that case

[Text.Encoding]::UTF8.GetString([Text.Encoding]::Default.GetBytes( `
    (Get-Clipboard -TextFormatType Html -Raw) `
)) 

will recode the text, but with the caveat that if the default ANSI encoding does not cover all code points from 0-255, some characters might get lost. Fortunately Windows-1252 (the most common default) does cover all code points.

Jeroen Mostert
  • 27,176
  • 2
  • 52
  • 85
  • One nitpick: This does not work if the Clipboard is empty: `Exception calling "GetBytes" with "1" argument(s): "Array cannot be null. Parameter name: chars"` – Christopher Oezbek Jun 10 '21 at 14:31
  • May i suggest you to precise that tests with sites having non UTF8 charset like `https://indiechina.com/` (GB2312/ simplified chinese charset) confirm that the html fragment in the clipboard is anyway _'Default"_ encoded ? That could be useful for other cases than the clipboard . – Zilog80 Jun 10 '21 at 14:35
  • @Zilog80: I'm not sure what you mean by "other cases than the clipboard" -- the fact the the encoding is UTF-8 is specifically tied to the clipboard HTML format and nothing else. I already mention this is independent of the site's encoding. I don't know which encoding PowerShell uses to retrieve the format (I really don't feel like decompiling and reverse engineering it), but obviously it cannot depend on the site either if the clipboard format already doesn't depend on it. – Jeroen Mostert Jun 10 '21 at 14:46
  • @ChristopherOezbek: I can't test right now, but you could see if using `((Get-Clipboard -TextFormatType Html -Raw) + @())` in the inner call fixes this (it should force the result to an empty object array on null, which is then treated as an empty byte array). – Jeroen Mostert Jun 10 '21 at 14:50
  • @JeroenMostert As it is a clipboard for HTLM question, mention tests just to verify what states the documentation is probably not useful. I suggest "other cases than the clipboard" because a [recent question](https://stackoverflow.com/q/67415761/3641635) about the output of external commands under PowerShell has the same trouble, 'Default' encoding for the output in some cases. There was also an unanswered six years question on the input side which i guess is related. So it seems to me that it may be useful for other to know that the 'Default' may apply in other case than the clipboard. – Zilog80 Jun 10 '21 at 15:07