3

I have some problems using vlookup in Excel. I have seen the issue, but I haven't got the solution yes.

I have a huge amount of lines in a txt file, these lines contains Unicode characters.

Example: This line: 'S0841488.JPG06082014‏‎08.21' contains those two unicode characters: U+200F U+200E 'S0841488.JPG06082014 U+200F U+200E 08.21.

Please tell me how to remove these unicode characters using Powershell.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Larry
  • 31
  • 1
  • 3
  • This is a common question. See https://stackoverflow.com/questions/61792387/powershell-find-non-ascii-characters-in-text-file – js2010 Apr 23 '23 at 13:20

2 Answers2

10

If you want to remove all characters that fall outside the ASCII range (Unicode code-point range U+0000 - U+007F):

# Removes any non-ASCII characters from the LHS string,
# which includes the problematic hidden control characters.
'S0841488.JPG06082014‏‎08.21' -creplace '\P{IsBasicLatin}'

The solution uses -creplace, the case-sensitive variant[1] of the regex-based -replace operator, with the negated form (\P) of the Unicode block name IsBasicLatin, which refers to the ASCII sub-range of Unicode. In short: \P{IsBasicLatin} matches any non-ASCII character, and since no replacement string is specified, effectively removes it; combined with -creplace invariably replacing all matches in the input string, all non-ASCII characters are removed.


You can verify that this effectively removes the (invisible) LEFT-TO-RIGHT MARK, U+200E and RIGHT-TO-LEFT MARK, U+200F characters from your string with the help of the Debug-String function, which is available as an MIT-licensed Gist:

# Download and define the Debug-String function.
# NOTE: 
#  I can personally assure you that doing this is safe, but you
#  you should always check the source code first.
irm https://gist.github.com/mklement0/7f2f1e13ac9c2afaf0a0906d08b392d1/raw/Debug-String.ps1 | iex


# Visualize the existing non-ASCII-range characters
'S0841488.JPG06082014‏‎08.21' | Debug-String -UnicodeEscapes

# Remove them and verify that they're gone.
'S0841488.JPG06082014‏‎08.21' -replace '\P{IsBasicLatin}' | Debug-String -UnicodeEscapes

The above yields the following:

S0841488.JPG06082014`u{200f}`u{200e}08.21
S0841488.JPG0608201408.21

Note the visualization of the invisible control characters as `u{200f} and `u{200e} in the original input string, and how they are no longer present after applying the -replace operation.

In PowerShell (Core) 7+ (but not Windows PowerShell), such Unicode escape sequences can also be used in expandable strings, i.e. inside double-quoted string literals (e.g., "Hi`u{21}" expands to verbatim Hi!) - see the conceptual about_Special_Characters help topic.


[1] See this answer for an explanation of why case-sensitive matching must be used.
Despite the operator being case-sensitive, the inherently case-insensitive \P{L} regex block-name construct still excludes lowercase letters too (whereas \P{Lu} / \P{Ll}would only exclude uppercase / lowercase letters).

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

If they are written as you say as 'U+200F' and 'U+200E' so basic replace will do the work. Only required to make escape character before +. Because it has a special meaning in regex.

Regex explanation:

\s - Means whitespace.

? - It maybe exists maybe not.

\ - Before + is an escape character.

[EF] - It means letter E or letter F.

('|.$) - single-quote or dot at the end.

# Path to your file and new file where no unicode characters.
$Path = "C:\JustForExample\FileThatContainsThisLines.txt"
$NewPath = "C:\JustForExample\FileThatContainsThisLines2.txt"

# Getting content of file.
$content = Get-Content -Path $Path

# Removing unicode characters.
$newContent = $content -replace "\s?U\+200[EF]\s?"

# For removing quot and dot at the end.
$newContent = $newContent -replace "('|\.$)" 

# Saving content to new Path.
# If you need to replace them inside the file, so just change NewPath to Path.
$newContent | Set-Content -Path $NewPath

So replace simple should remove what is a match to that pattern.

Puzo
  • 95
  • 3
  • Hi Puzo Thank you - but - I´m sorry, it doesn´t work as intended. I use this one URL for viewing the Unicode characers: https://www.soscisurvey.de/tools/view-chars.php - and after running the script, the characters haven´t been removed. The result is: S0841488.JPG06082014‏‎ U+200F U+200E 08.21. – Larry Jul 10 '21 at 13:35
  • Hmmm... My main issue because I can't reproduce it to check why this is not working. Maybe you can try that convert all content to UTF8, and then they become visible. $MyRawString = Get-Content -Raw $Path $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False [System.IO.File]::WriteAllLines($NewPath, $MyRawString, $Utf8NoBomEncoding) Can you share some example file? – Puzo Jul 10 '21 at 16:13
  • The desire is to remove the actual Unicode characters from the strings - something like `U+200F` is just a _visualization_ of those characters, which themselves are _invisible_. If you copy and paste the string following the phrase "This line: " from the question, you'll get a string that contains these hidden characters. – mklement0 Jul 10 '21 at 19:32
  • Removing all characters that fall outside the ASCII range using this command: $newContent = $content -replace '\P{IsBasicLatin}' Though, it creates a new error: ALL 'I' (capital i) is removed. – Larry Jul 14 '21 at 13:55
  • On a meta note, @Larry: Unless you @-mention me (as I have just done with your username), I won't be notified of follow-up comments. If you comment on _my_ answer (which would have been better, and any follow-up comments should be posted there), you wouldn't have needed the @-mentioning. – mklement0 Jul 16 '21 at 13:59
  • @Larry, re the `i` / `I` problem: you've hit on a serious bug in _Windows PowerShell_, since corrected in PowerShell (Core) - please see my updated answer. – mklement0 Jul 16 '21 at 13:59