1

The example names that I am trying it on are here

O'Kefe,Shiley
Folenza,Mitchel V
Briscoe Jr.,Sanford Ray
Andade-Alarenga,Blnca 
De La Cru,Feando
Carone,Letca Jo
O'Conor,Mole K
Daeron III,Lawence P
Randall,Jason L
Esquel Mendez,Mara D
Dinle III,Jams E
Coras Sr.,Cleybr E
Hsieh-Krnk,Caolyn E
Graves II,Theodore R

I am trying to capture everything before comma except the roman numbers and Sr.|Jr. suffix. So if the name is like Andade-Alarenga,Blnca I want to capture Andade-Alarenga, but if the name is Briscoe Jr.,Sanford Ray I just want Briscoe.

the code I have tried is here

^((?:(?![JjSs][rR]\.|\b(?:[IV]+))[^,]))

also this one - ^(?!\w+ \A[jr|sr|Jr|Sr].*)\w+| \w+ \w+|'\w+|-\w+$

[Regex101 my code with example sets][1]

https://regex101.com/r/jX5cK6/2

4 Answers4

1

One option could be using a capturing group with a non greedy match up till the first occurrence of a comma and optionally before the comma match Jr Sr jr sr or a roman numeral.

Then match the comma itself. The value is in capture group 1.

An extended match for a roman numeral can be found for example on this page as the character class [XVICMD]+ is a broad match which would also allow other combinations.

^(\w.*?)(?: (?:[JjSs]r\.|[XVICMD]+\b))?,
  • ^ Start of string
  • ( Capture group 1
    • \w.*? Match a word char and 0+ times any char except a newline non greedy
  • ) close group
  • (?: Non capturing group
    • (?: Match a space and start non capturing group
      • [JjSs]r\. Match any of the listed followed by r.
      • | Or
      • [XVICMD]+\b Match 1+ times any of the listed and a word boundary
    • ) Close group
  • )? Close group and make it optional
  • , Match the comma

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • The regex in the comments solves the problem without needlessly matching roman numerals, commas, and titles. – jrook Oct 17 '19 at 22:56
  • 1
    @jrook The question states `I am trying to capture everything before comma except the roman numbers and Sr.|Jr. suffix` So the capturing group matches until the comma, but only matches romal numberals or Jr Sr etc that is unwanted. The wanted value is in group 1. – The fourth bird Oct 17 '19 at 22:59
1

Because of your test on Regex101, I'm assuming your regex engine supports positive lookaheads (This is true for PCRE, Javascript or Python, for example)

A positive lookahead will enable you to match only what you want, without the need for capturing groups. The full match will be the string you're looking for.

^[\w'\- ]+?(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)

The part that matches the name is as simple as it gets:

^[\w'\- ]+?

All it does is match any of the characters on the list. the final ? is there to make it lazy: This way, the engine will only match as few characters as it needs to.

The important part is this one:

(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)

It is divided in two parts by the pipe (this character: |) there. The first part matches roman numerals (or nothing), and the second part matches titles (Basically, anything that ends on a .). Finally, we need to match the comma, because of your requirement.

Here it is on Regex101

Not a real meerkat
  • 5,604
  • 1
  • 24
  • 55
1

You didn't specify a language so I used a regex in the replaceAll() String method of Java.

      String[] names = {
            "O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
            "Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
            "O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
            "Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
            "Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"

      };

      for (String name : names) {
         System.out.println(name + " -> "
               + name.replaceAll("(I{1,3},|((Sr|Jr)\\.,)|,).*", ""));
      }

Here is a python solution using re.sub


    import re
    names = ["O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
                "Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
                "O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
                "Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
                "Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"]

    for name in names:   
        print(name, "->", re.sub("(I{1,3},|((Sr|Jr)\\.,)|,).*","",name))
WJS
  • 36,363
  • 4
  • 24
  • 39
1

You may use

^(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+\b(?<!\s)

See the regex demo

Details

  • ^ - start of a string
  • (?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+ - any char but , ([^,]), one or more occurrences (+), that does not start a Jr. or Sr. char sequence or a whole word consisting of 1 or more X, V, I, C, M,D chars
  • \b - a word boundary
  • (?<!\s) - no whitespace immediately to the left is allowed (it is trimming the match)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563