Regex to extract name from a string

Question

I'm trying to use regular expression to extract the name from a string. The name always follow by a protocol. The protocols are: ssh , folder, http.

Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *
Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *
Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *

The expected output would be:

John
Jake
Steve

score 2 · Answer 1 · answered May 24 '19 at 01:29

You can use the following PCRE regex (as you haven't precised which language):

\b[a-zA-Z]+(?=\s+(?:ssh|folder|http))

demo: https://regex101.com/r/t62Ra7/4/

Explanations:

\b start the match from a word boundary
[a-zA-Z]+ match any sequence of ASCII character in a-zA-Z range, you might have to generalise this to accept Unicode letters.
(?= lookahead pattern to add the constraint that the name is followed by one of the protocols
\s+ a whitespace class char
(?:ssh|folder|http) non-capturing group for the protocols ssh, folder or http

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

Try:

\b[A-Za-z]+(?=\s(?=ssh|folder|http))

Regex Demo here.

let regex = /\b[A-Za-z]+(?=\s(?=ssh|folder|http))/g;

[match] = "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *".match(regex);
console.log(match); //John

[match] = "Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *".match(regex);
console.log(match); //Jake

[match] = "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *".match(regex);
console.log(match); //Steve

Regex explanation:

\b defines a word boundary to start match

[A-Za-z] match any alphabet, any case

+ repeat previous character any number of times till next pattern

(?= finds lookahead pattern (which won't be included in matching group)

\s a whitespace

(?=ssh|folder|http) another lookahead to either ssh, folder or http

Putting it all together, the regex looks for a word that is followed by a space and then one of the following: ssh, folder, or http.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 24 '19 at 00:45

chatnoir

2,185
1
15
17

2 points here: you do not need to have nested lookahead, and the more important point is `[A-z]` does not behave how you think it does!!!!! this will not match only letters!!!! https://stackoverflow.com/questions/4923380/difference-between-regex-a-z-and-a-za-z – Allan May 24 '19 at 01:31
your regex does accept `J]ohn`, `Ja[ke`, ... >>> https://regex101.com/r/T4gayx/1/ – Allan May 24 '19 at 01:33
Good catch, much appreciated! Fixed now. – chatnoir May 24 '19 at 01:36
`[A-ZA-z]` will still match more than what you want https://regex101.com/r/T4gayx/2/, the correct range is `[A-Za-z]` – Allan May 24 '19 at 02:15
1

Fixed the typo in the text. The code snippet and demo code was fine though. Again, thanks for the catch! – chatnoir May 24 '19 at 02:17
Visually >>> https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html `[A-z]` will accept char for which Dec ascii value is included between `65` and `122`, including `[\^_` that are not letters. – Allan May 24 '19 at 02:18

WJS · Answer 3 · 2019-05-24T00:57:19.340

Here's how you might do it in Java.

String[] str = {
            "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *    ",
            "Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 * ",
            "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *  ",
      };

      String pat = "(\\w+) (ssh|folder|http)"; // need to escape the second \
      Pattern p = Pattern.compile(pat);
      for (String s : str) {
         Matcher m = p.matcher(s);
         if (m.find()) {
            System.out.println(m.group(1));
         }

      }
   }

The actual pattern is in the string pat and can be used with other regex engines. This simply matches a name followed by a space followed by the protocols or'd together. But it captures the name in the first capture group.

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

Another approach would be to take the single letter and space present right before the names as a left boundary, then collect the names' letters and save it in capturing group $1, maybe similar to:

\s+[a-z]\s+([A-Z][a-z]+)

We can also add more boundaries to it, if it might be necessary.

RegEx

If this expression wasn't desired, it can be modified or changed in regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

DEMO

Test

const regex = /\s+[a-z]\s+([A-Z][a-z]+)/gm;
const str = `Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *
Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *
Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Regex to extract name from a string

4 Answers4

RegEx

RegEx Circuit

DEMO

Test