1

Given for example a string like this:
random word, random characters##?, some dots. username bob.1234 other stuff

I'm currently using this regex to capture the username (bob.1234):

\busername (.+?)(,| |$)

But my code needs a regex with only one capture group as python's re.findall returns something different when there are multiple capture groups. Something like this would almost work, except it will capture the username "bob" instead of "bob.1234":

\busername (.+?)\b

Anybody knows if there is a way to use the word boundary while ignoring the dot and without using more than one capture group?

NOTES:

  • Sometimes there is a comma after the username
  • Sometimes there is a space after the username
  • Sometimes the string ends with the username
NaturalBornCamper
  • 3,675
  • 5
  • 39
  • 58
  • Try `\busername ([^ ,]+)` – Wiktor Stribiżew Feb 25 '18 at 18:25
  • or maybe `username (\S+)\b` – revo Feb 25 '18 at 18:29
  • Both are working actually guys! I didn't think about going a different direction completely Wiktor, good thinking! You guys care to post a solution so I can accept? Not sure who to give it to however, both are perfect – NaturalBornCamper Feb 25 '18 at 18:43
  • If usernames don't end with periods or anything other than letters and numbers then I can post mine as an answer. – revo Feb 25 '18 at 18:49
  • Actually Revo, it also works for usernames ending with an underscore – NaturalBornCamper Feb 25 '18 at 18:57
  • @NaturalBornCamper. Which *specific* set of characters can these usernames contain? – ekhumoro Feb 25 '18 at 19:01
  • I guess Wiktor's solution would be better here since it also accepts usernames ending with a dot. However if you care to post also Revo, I'll add as supporting answer (I think I can still do that), and it might help other people – NaturalBornCamper Feb 25 '18 at 19:02
  • Yes, but wasn't that important to state. Those said will result in an unexpected match. – revo Feb 25 '18 at 19:04
  • @ekhumoro I don't know, that's why I put "and/or" in the question, however I would guess alphanumeric, underscores, dots. Wiktor's solution works anyways and I can customize it if more characters show up in usernames – NaturalBornCamper Feb 25 '18 at 19:05
  • @revo I haven't seen any occurrences of usernames ending with dots, I actually didn't even think about it, shame on me. But I guess it *could happen – NaturalBornCamper Feb 25 '18 at 19:07
  • @NaturalBornCamper. I was asking because the regexp by revo will match e.g. `username b,,,!@=++&&=4`. – ekhumoro Feb 25 '18 at 19:12
  • He's not validating but matching. These two differ from each other. @ekhumoro – revo Feb 25 '18 at 19:40
  • @revo. The spec says user-names can be followed by a comma, then "other stuff". Given the example `username bob.1234,otherstuff`, your regexp doesn't produce `bob.1234`, because it will match the comma and everything following it up to the next word boundary. – ekhumoro Feb 25 '18 at 19:51
  • Does that mean after a comma there is something other than a space character? @ekhumoro – revo Feb 25 '18 at 19:57
  • @revo. I don't know - that's why I asked the OP to clarify. Your regexp *might* be okay, but it relies on the input being nicely formatted with unambiguous delimiters. If usernames can contain some of the possible delimiters, that will probably complicate things. – ekhumoro Feb 25 '18 at 20:04
  • Before throwing a suggestion I looked at subject string format while reading about cases. That was not catchable that when username ends with a comma it may be followed by immediate non-whitespace characters. Otherwise yes, the regex wouldn't work well. @ekhumoro – revo Feb 25 '18 at 20:22
  • 1
    You can't use a negated class, it has to be 1 or more... `[]+`.then it can't handle end of string. If you use `[]+?` it will only match 1 char. If you use `[]*?` it won't match any characters (because its at the end). So, classes are out. You can change your regex to this `\busername (.+?)(?:,| |$)` and it will match what you want. Note that when you use a _negated_ class, you introduce a crap load of characters it will match. I'd try not to overthink this... –  Feb 25 '18 at 20:42
  • @sln Why `[]+` can't handle end of string when username meets an end? – revo Feb 25 '18 at 20:51
  • @revo - `[]+` is for illustrating the quantifiers. Example `\busername (.+?)[ ,]+` If you use strictly the negated class instead of `.+?` you match a significant number of characters (which may be ok). –  Feb 25 '18 at 21:03
  • @ekhumoro is right, it's not "always" nicely formatted, it might end with the username, there might be a space immediately after it, there might be a comma then no space after.. but Wiktor's solution works perfect with each sample so far, thanks guys! – NaturalBornCamper Feb 26 '18 at 15:10

1 Answers1

1

The \busername (.+?)(,| |$) pattern contains 2 capturing groups, and re.findall will return a list of tuples once a match is found. See findall reference:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

So, there are three approaches here:

  1. Use a (?:...) non-capturing group rather than the capturing one: re.findall(r'\busername (.+?)(?:,| |$)', s). It will consume a , or space, but since only captured part will be returned and no overlapping matches are expected, it is OK.
  2. Use a positive lookahead instead: re.findall(r'\busername (.+?)(?=,| |$)', s). The space and comma will not be consumed, that is the only difference from the first approach.
  3. You may turn the (.+?)(,| |$) into a simple negated character class [^ ,]+ that matches one or more chars other than a space or comma. It will match till end of string if there are no , or space after username.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Just FYI: [*I can customize it if more characters show up in usernames*](https://stackoverflow.com/questions/48976902/regex-capture-using-word-boundaries-without-stopping-at-dot-and-or-other-char/48978221#comment84959578_48976902). Yes, it might be necessary once the final specs are formulated for the username pattern. – Wiktor Stribiżew Feb 25 '18 at 20:44
  • 1
    Sorry, I always do that usually, forgot this time! ;) – NaturalBornCamper Feb 26 '18 at 19:13