If you want to remove all characters that fall outside the ASCII range (Unicode code-point range U+0000
- U+007F
):
# Removes any non-ASCII characters from the LHS string,
# which includes the problematic hidden control characters.
'S0841488.JPG0608201408.21' -creplace '\P{IsBasicLatin}'
The solution uses -creplace
, the case-sensitive variant[1] of the regex-based -replace
operator, with the negated form (\P
) of the Unicode block name IsBasicLatin
, which refers to the ASCII sub-range of Unicode. In short: \P{IsBasicLatin}
matches any non-ASCII character, and since no replacement string is specified, effectively removes it; combined with -creplace
invariably replacing all matches in the input string, all non-ASCII characters are removed.
You can verify that this effectively removes the (invisible) LEFT-TO-RIGHT MARK, U+200E
and RIGHT-TO-LEFT MARK, U+200F
characters from your string with the help of the Debug-String
function, which is available as an MIT-licensed Gist:
# Download and define the Debug-String function.
# NOTE:
# I can personally assure you that doing this is safe, but you
# you should always check the source code first.
irm https://gist.github.com/mklement0/7f2f1e13ac9c2afaf0a0906d08b392d1/raw/Debug-String.ps1 | iex
# Visualize the existing non-ASCII-range characters
'S0841488.JPG0608201408.21' | Debug-String -UnicodeEscapes
# Remove them and verify that they're gone.
'S0841488.JPG0608201408.21' -replace '\P{IsBasicLatin}' | Debug-String -UnicodeEscapes
The above yields the following:
S0841488.JPG06082014`u{200f}`u{200e}08.21
S0841488.JPG0608201408.21
Note the visualization of the invisible control characters as `u{200f}
and `u{200e}
in the original input string, and how they are no longer present after applying the -replace
operation.
In PowerShell (Core) 7+ (but not Windows PowerShell), such Unicode escape sequences can also be used in expandable strings, i.e. inside double-quoted string literals (e.g., "Hi`u{21}"
expands to verbatim Hi!
) - see the conceptual about_Special_Characters help topic.
[1] See this answer for an explanation of why case-sensitive matching must be used.
Despite the operator being case-sensitive, the inherently case-insensitive \P{L}
regex block-name construct still excludes lowercase letters too (whereas \P{Lu}
/ \P{Ll}
would only exclude uppercase / lowercase letters).