0

I wish to catch the invalid character inside a .csv file. Currently I only able catch all the invalid characters that are not English only, is there anyway to catch all invalid characters except English & Germany?

The following code is able to filter the invalid characters that is not English letters.

$path = "product.csv"

$a = Get-Content $path | Select-String -AllMatches -Pattern "[^\x00-\x79]" | Select-Object LineNumber,Line,@{Name='String';Expression={$_.Matches.Value}}
$b = $a.count

$a
Write-Host "Total:  $b"

All Germany Characters that containing in People Name are counted as Valid Characters.

Manuel Batsching
  • 3,406
  • 14
  • 20
Yong Cai
  • 143
  • 1
  • 17
  • If you want to check for invalid characters in a file path check out [GetInvalidFileNameChars()](https://stackoverflow.com/questions/23066783/how-to-strip-illegal-characters-before-trying-to-save-filenames) – Olaf Reitz Oct 20 '17 at 09:12
  • Sorry I forgot to mentioned, this get-content is to read the content of .csv file, not file name. – Yong Cai Oct 20 '17 at 09:15
  • Is it intentional, that you allow the characters "[]" but not "{}"? – Manuel Batsching Oct 20 '17 at 09:50
  • 1
    What's an "invalid character" anyway? The concept makes no sense. What are you trying to achieve? – Tomalak Oct 20 '17 at 10:15
  • @Tomalak hi, thanks for the question. The values inside pattern are wrong, just for an example. This script basically scan through the .csv form which contain a person name, date those common attribute, so I will eliminate those symbol such as !,* which as considered invalid characters for a person name, address, etc.... – Yong Cai Oct 23 '17 at 01:12
  • Read ["Falsehoods Programmers Believe About Names"](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/). Also, if somebody entered `"What should I write here?"`into the name field, it does not become valid just because you eliminate the question mark. The set of characters that you consider "invalid" ("invalid in German" even more so!) will be wrong. Filtering out "invalid" characters will not make your data a better. Logical conclusion - this is pointless. Stop wasting your time on it. – Tomalak Oct 23 '17 at 05:32

1 Answers1

2

The easiest way would be to add the hex literals for the German specific characters to your match group. The characters you are looking for are:

 ß \xdf
 Ü \xdc
 ü \xfc
 Ä \xc4
 ä \xe4
 Ö \xd6
 ö \xf6

So your new match group would be:

-Pattern "[^\x00-\x79\xdf\xdc\xfc\xc4\xe4\xd6\xf6]"

Edit:

As an alternative to matching characters by their code points you could also use the actual characters in your match pattern:

-Pattern "[^a-zA-ZäÄöÖüÜß]"

Its easier to read and also doesn't include all these non-human-readable control characters between \x00 and \x21 that you are matching above.

Manuel Batsching
  • 3,406
  • 14
  • 20
  • Hi Manuel Batsching, I search online there are more than 7 german characters [link](https://www.alt-codes.net/german_alt_codes/) . So isit your provided match group are all correct or there are more german characters need to add on? And may I know how did you get the hexadecimal value? Is there any resources with full list to view? I may need further to proceed to check for another language later such as Chinese, Korean those language later. Thanks – Yong Cai Oct 23 '17 at 01:22
  • Why the hex escaping? – Tomalak Oct 23 '17 at 05:21
  • @YongCai As a German I can ensure you, that these 7 extra characters that you found are not used in the German alphabet. – Manuel Batsching Oct 23 '17 at 14:16
  • @Tomalak The hex escaping is not necessary, but as the OP used it I was looking to a solution as close as OP's code as possible. – Manuel Batsching Oct 23 '17 at 14:18
  • @ManuelBatsching May I know where to get these hexadecimal value of these characters? Any links? – Yong Cai Oct 24 '17 at 04:58
  • The windows character map will give you the unicode code points (in hex). Alternatively you can use powershell `'{0:X2}' -f [int][char]'ä'`. – Manuel Batsching Oct 24 '17 at 09:09