How to substitute accented characters when renaming text files using first line in Powershell

Question

I'm trying to batch rename plain text files using the first line of each file. I want to keep only alphanumeric characters in with your help I'm almost there. The only issue is that I need accented characters like é or á to be preserved in a form of their respective not accented characters: e and a (text is in Spanish) or be preserved in the name as they are, not removed. This is what I'm using right now:

Get-ChildItem *.txt | Rename-Item -NewName {
    $firstLine = ($_ | Get-Content -TotalCount 1) -replace '[^a-z0-9 ]'
    '{0}.txt' -f $firstLine
}

Thank you. If possible, please let me know if there is a way to keep the symbol "?" too.

Since `?` is a wildcard symbol in many shells, one really shouldn't try to use it in a file name. (In MacOS' HFS+ it is a valid character.) — vonPryz, Mar 30 '23 at 20:01

Santiago Squarzon · Answer 1 · 2023-03-31T00:45:28.800

2

Approach is similar to the one used in this answer, you can use the String.Normalize Method before your regex replacement.

As for not removing ?, you can simply add it to the character range: [^a-z0-9 ?], however it is an invalid character for file names in Windows, thus not used in the renaming code snippet for this answer. You can use [IO.Path]::GetInvalidFileNameChars() to get the list of invalid characters for your OS.

Get-ChildItem *.txt | Rename-Item -NewName {
    $firstLine = ($_ | Get-Content -TotalCount 1 -Encoding utf8).
        Normalize([Text.NormalizationForm]::FormD) -replace '[^a-z0-9 ]'

    '{0}.txt' -f $firstLine
}

Example:

$string = 'áÁéÉñÑ?'
$string.Normalize([Text.NormalizationForm]::FormD) -replace '[^a-z0-9 ?]'

# Outputs:
# aAeEnN?

Worth noting, default Get-Content encoding will be problematic in Windows PowerShell:

Default Uses the encoding that corresponds to the system's active code page (usually ANSI).

Thus the need for -Encoding utf8. Newer PowerShell versions don't have such problem as they default to utf8NoBOM.

edited Mar 31 '23 at 00:45

answered Mar 30 '23 at 20:09

Santiago Squarzon

41,465
5
14
37

The fact that all symbols and accented characters get substituted by an "A" may be related to encoding? – eera5607 Mar 30 '23 at 20:22
@eera5607 make sure your `.ps1` script is saved using utf8 with BOM encoding. Other than that, will need more details on where you're seeing the issue – Santiago Squarzon Mar 30 '23 at 20:24
1

I believe for normalization to work correctly you also need to add `-Encoding UTF8` to `get-content` – markalex Mar 30 '23 at 20:29
1

@markalex good point but it has nothing to do with normalization. It has to do with `Get-Content` in PowerShell 5.1 not using the right encoding to read files.. – Santiago Squarzon Mar 30 '23 at 20:32
I mean since it reads something like `Â¿Ã¡` instead of `éá`, it normalized to `AA` and not `ea`. But you totally right, its not normalization's fault. – markalex Mar 30 '23 at 20:34

markalex · Accepted Answer · 2023-03-30T20:23:31.450

All you need to do is to add á and é to exclusion list of your replacement, ant they will be preserved:

Get-ChildItem *.txt | Rename-Item -NewName {
    ($_ | Get-Content -TotalCount 1 -Encoding UTF8) -replace '[^a-z0-9éá ]', '' -replace '.*', '$0.txt'
}

As for ? - it is not valid symbol for filename in windows, so I don't see a point there. But you always can do multiple replacement, and replace it with something allowed. Like so:

"asd we'wea?gke é or á? to b" -replace '[^a-z0-9éá ]', '' -replace '\?', '!!!!'

mklement0 · Answer 3 · 2023-03-30T21:27:15.880

Santiago Squarzon's helpful answer shows you how to transform accented letters - such as é - to their unaccented form, such as e, causing them to be covered by the a-z regex range expression.

As for preserving accented characters as-is (which you state is acceptable too):

In lieu of a-z you can use \p{Ll}, which matches any Unicode lowercase letter therefore also accented ones (see the list of all Unicode categories).
By virtue of -replace being case-insensitive, uppercase letters are implicitly considered as well:

Get-ChildItem *.txt | Rename-Item -NewName {
  $firstLine = 
    ($_ | Get-Content -TotalCount 1 -Encoding utf8) -replace '[^\p{Ll}0-9 ]'
  '{0}.txt' -f $firstLine
}

^{Note: I'm using -Encoding utf8, as in the other answers, to read your file, which is only necessary in Windows PowerShell if your file happens to be UTF-8-encoded but without a BOM.}

A simplified example:

# -> 'aÄ éE 42'; that is, all letters and digits were preserved.
'a-Ä é/E 42' -replace '[^\p{Ll}0-9 ]'

Thanks, @Santiago. I deliberately left out `?` after reading eera5607's comment in response to vonPryz's comment about how using `?` in _file names_ is best avoided (if even supported, which varies by platform). — mklement0, Mar 30 '23 at 22:15

How to substitute accented characters when renaming text files using first line in Powershell

3 Answers3