-2

I was doing some searching and found a similar topic Powershell find non-ASCII characters in text file

MY function below is not handling all cases for example apostrophe or all the special characters

function IsStringDiacritic {
    param (
        [parameter(Mandatory = $True)][string]$String
    )
    
    If ($String -as [System.Net.Mail.MailAddress]) {
        $String = $String.Split('@')[0]
    }

    Return [bool]($String -cmatch '[^\x20-\x7F]')
}

Above is the function I made but I am not getting what I need.

I want to send the function first.last and if there is a diacritic return a true or false.

My function is able to deal with an email address as well with the test and then split but that is not the primary part.

I think I need a regex that will look at first.last or first last but I am not sure how to include the possiblites.

Any better ideas?

dcaz
  • 847
  • 6
  • 15
  • 1
    If you need to find diacritics, why use a pattern that matches any char other than printable ASCII? Use `-match '\p{M}'` – Wiktor Stribiżew Oct 25 '21 at 19:28
  • Could have something here but none of my testing with your answer is working. @WiktorStribiżew – dcaz Oct 25 '21 at 19:31
  • 1
    @WiktorStribiżew refers to the literal definition of _diacritic_, which is a mark such as the COMBINING DIAERESIS, [`U+0308`](http://www.fileformat.info/info/unicode/char/308)), which _modifies another character_, typically to form an _accented character_, which is probably what you meant. `\p{M]` matches such a mark. However, given that such marks are rarely used _as separate `[char]` instances_ (Unicode code units) in .NET, that is rarely useful (see next comment). – mklement0 Oct 26 '21 at 03:51
  • 1
    Background: Unicode has multiple _equivalent_ ways of representing an accented character: as a _single_ code point that represents the _composed_ form, e.g. `ä` (code point `0xe4`) or in _decomposed_ form: `'a'` followed by the aforementioned diaeresis (`'a' + [char] 0x308`). Both _render_ the same, and `-eq` recognizes the equivalence - but `-match` does not. Given that the composed, single-code-point form of accented characters is far more common than the decomposed two-code-point form, matching against `\p{M]` is rarely useful, as it only matches the diacritic alone, as a separate code unit – mklement0 Oct 26 '21 at 03:53
  • It seems my entire question was wrong and my understanding of the rules is not complete. What I needed was to find if an email address had ```[!#$%^&*(`/?,'' äöüßÄÖÜ)]``` After all the help I got here I made the following. ```function IsStringSpecialCharacters { param ( [parameter(Mandatory = $True)][string]$String ) $String -match '[!#$%^&*(`/?,'' äöüßÄÖÜ)]' }``` This returns true if an email address has any of the special characters that I defined. – dcaz Oct 26 '21 at 14:02

2 Answers2

1

It looks like your true intent wasn't to find characters with diacritics, but to ensure that a given name - either specified in isolation or as the username part of an email address (the part before @) - is composed only of the following:

  • lowercase ASCII-range (English) letters, i.e. a through z
  • a . or space, if any, to separate the name components.

A PowerShell-idiomatic solution is to define a Test-Name function that indicates whether a give name is valid:

function Test-Name {

  param (
      [Parameter(Mandatory)]
      [string]$Name
  )
  
  $Name -cmatch '^[a-z]+(?:[. ][a-z]+)?(?:@.+)?$'

}

Calling Test-Name with, for example, foo.bar, foo bar, foo.bar@example.org, or foobar@example.org yields $true, whereas föo.bär, Foo.bar, foo-bar, and .foobar yield $false.

Note:

  • If uppercase English letters are also acceptable, replace -cmatch with -match.

  • To allow additional separator characters, add them to the [. ] character set; e.g., to include - and _, use [. _-] (place - first or last, so that it isn't interpreted as part of a range of characters, such as in [a-z])

  • (?:@.+)? matches everything starting with @, if present (but places no constraint on what follows the @ other than having to comprise at least one character).

  • Note how the entire string is matched to ensure that a name doesn't start or end with a . or space, and that only one separator is present.

    • If you also want to allow, say, three name components (e.g. 'foo.bar.baz'), use the following regex:
      • ^[a-z]+(?:[. ][a-z]+){0,2}(?:@.+)?$
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Since this is a closed topic due to my many error if not following or understanding the rules. I have tried to define what I want. It seems my question should have been how do I find if an email string has charters like ```!#$%^&*(`) ``` The answer above works but not for my needs. – dcaz Oct 26 '21 at 03:05
  • 1
    @dcaz, even that isn't clearly defined. What, precisely, defines the unwanted characters? If you can enumerate them easily, use a character set (`[...]`), e.g. `[!#$%^&*(\`)]`. However, it may be better to specify those characters that you wan to _allow_, as shown in this answer. If `[a-z]` is not enough, add extra characters _there_, such as `[a-z0-9]` in order to also allow decimal digits. Again, to also allow uppercase characters, use `-match` instead of `-cmatch`. – mklement0 Oct 26 '21 at 03:19
0

The answer that seems to work for now is as follows.

function IsStringDiacritic {
    param (
        [parameter(Mandatory = $True)][string]$String
    )
    
    If ($String -as [System.Net.Mail.MailAddress]) {
        $String = $String.Split('@')[0]
    }
    
    If ($String -like '*.*') {
        $String = $String.Replace('.', '')
    }

    $String = $String.Trim()
    
    Return [bool]($String -cmatch '[^a-z]')
}

My original function was not handling all cases for example apostrophe or all the special characters that someone could user in error or on purpose. My function so far seems to tell me if there is anything that is a character that is not a to z. The limits it has is the English language.

dcaz
  • 847
  • 6
  • 15