2

I have 20,000+ records to deal with, but multiple passes like below is fine, unless of course all of it can be done in one super-effficient regex??

Sample records:

ABBEY Chantelle - 08.11.1995 - A

ANAND Toni-Grace - 04.09.1999 - A

ADCOCK ALVEY James - 12.04.1992 - C

ADLINGTON-JONES Robin Jacob Sebastian - 15.02.1999 - B

AFZAL Kiera - 25.04.2000 - B

AHMED Nisar Abu Ben Adhem - 16.08.2002 - C

AIRE-DEANE Christopher-James - 06.01.1997 - B

AL-MISRI Yaqoob - 23.07.2004 - C

ASTER Lily-May - 01.04.2010 - B

McQUEEN Stephen - 02.02.2001 - A

Desired output:

ABBEY¬Chantelle¬08.11.1995¬A

ANAND¬Toni-Grace¬04.09.1999¬A

ADCOCK ALVEY¬James¬12.04.1992¬C

ADLINGTON-JONES¬Robin¬Jacob¬Sebastian¬15.02.1999¬B

AFZAL¬Kiera¬25.04.2000¬B

AHMED¬Nisar¬Abu¬Adhem¬16.08.2002¬C

AIRE-DEANE¬Christopher-James¬06.01.1997¬B

AL-MISRI¬Yaqoob¬23.07.2004¬C

ASTER¬Lily-May¬01.04.2010¬B

McQUEEN Stephen¬02.02.2001¬A

First Pass:

  • Find: ^([A-Z]{2,20}-[A-Z]{2,20}) ([A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$

  • RegEx: \1¬\2¬\3¬\4

  • Result:

    AL-MISRI¬Yaqoob¬23.07.2004¬C

Second Pass:

  • Find: ^([A-Z]{2,20}) ([A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$

  • RegEx: \1¬\2¬\3¬\4

  • Result:

    ABBEY¬Chantelle¬08.11.1995¬A

    AFZAL¬Kiera¬25.04.2000¬B

    McQUEEN Stephen¬02.02.2001¬A

Third Pass:

  • Find: ^([A-Z]{2,20}) ([A-Za-z]{1,20}-[A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$

  • RegEx: \1¬\2¬\3¬\4

  • Result:

    ANAND¬Toni-Grace¬04.09.1999¬A

    ASTER¬Lily-May¬01.04.2010¬B

Fourth Pass:

  • Find: ^([A-Z]{2,20}-[A-Z]{2,20}) ([A-Za-z]{1,20}-[A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$

  • RegEx: \1¬\2¬\3¬\4

  • Result:

    AIRE-DEANE¬Christopher-James¬06.01.1997¬B

But the above Regexes can't account for these records

ADCOCK ALVEY James - 12.04.1992 - C

ADLINGTON-JONES Robin Jacob Sebastian - 15.02.1999 - B

AHMED Nisar Abu Ben Adhem - 16.08.2002 - C

Notes:

All Last names appear first [IN CAPITALS] some may be hyphenated, First- (second- and other middle-) names are next in Title Case and MAY be hyphenated too

Match Case is Enabled in Notepad++ during the Search and Replace activity. None of the Names have an apostrophe (e.g. O'KEEFE), they have all been removed

Even if just the Names can be sorted, I can deal with the Dates and Suffixes separately, any help would be greatly appreciated as I'm still a novice to RegEx

I also apologies in advance if I have missed an existing solution, just in case I didn't select the correct tags or terminology during my searches on this site

I've checked this article; however, it didn't help to resolve my query: Regular expression for first and last name

Gary
  • 13,303
  • 18
  • 49
  • 71
Ifte
  • 55
  • 8

2 Answers2

2

Matching names is not so easy due to all the possibilities, but for the given example data you might use a pattern with \G to select the spaces and - parts in between replacing them with ¬

Use (?-i) or tick the Match case checkmark.

(?-i)(?:^(?:Mc)?[A-Z]+(?:[ -][A-Z]+)*|\G(?!^)[A-Z][a-z]+(?:-[A-Z][a-z]+)*|\d{2}\.\d{2}\.\d{4})\K -?\h*

Regex demo

enter image description here

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Tested in Notepad++ with the following results: The spaces between the Lastnames and Forenames are NOT substituted with the ¬ character – Ifte Jul 08 '20 at 15:52
  • @lfte You have to turn off case insensitivity `(?-i)(?:^(?:Mc)?[A-Z]+(?:[ -][A-Z]+)*|\G(?!^)[A-Z][a-z]+(?:-[A-Z][a-z]+)*|\d{2}\.\d{2}\.\d{4})\K -?\h*` – The fourth bird Jul 08 '20 at 17:10
  • With the addition of the (?-i) in the query, you've cracked it. Thanks for this excellent solution, I'll check the breakdown of the script on Regex101 and document it for my understanding. Much appreciated, saved me hours of work. Thanks a lot for your help. Thanks also to Youyoun for providing a possible solution. Cheers. – Ifte Jul 09 '20 at 07:26
  • You are welcome. This part `(?-i) ` is an inline modifier that turns of case insensitive. – The fourth bird Jul 09 '20 at 07:31
  • 1
    I'm still very new to RegEx and learning all the time, thanks to excellent educators like yourself, much appreciated. Thanks for sharing your knowledge – Ifte Jul 09 '20 at 07:55
0

This regexp works on almost all names (not McQUEEN because its not all caps):

(([A-Z]+[ \-]){1,})(([A-Z][a-z]+[ \-]){1,})\- ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])

Groups that can be used are \1 \3 \5 \6.

Link for demo: https://regex101.com/r/3LpI54/1

Youyoun
  • 338
  • 1
  • 5
  • ABBEY ¬Chantelle ¬08.11.1995¬A ANAND Toni-¬Grace ¬04.09.1999¬A ◄- ADCOCK ALVEY ¬James ¬12.04.1992¬C ADLINGTON-JONES Robin Jacob ¬Sebastian ¬15.02.1999¬B ◄- AFZAL ¬Kiera ¬25.04.2000¬B AHMED Nisar Abu Ben ¬Adhem ¬16.08.2002¬C ◄- AIRE-DEANE Christopher-¬James ¬06.01.1997¬B ◄- AL-MISRI ¬Yaqoob ¬23.07.2004¬C ASTER Lily-¬May ¬01.04.2010¬B ◄- McQUEEN ¬Stephen ¬02.02.2001¬A The rows identified by these symbols aren't giving the correct results, the substitution character ¬ is not in the correct place ◄- – Ifte Jul 08 '20 at 15:56