Using Regex to find words with characters that are the same or that are different

Question

I have a list of words such as:

l = """abca
bcab
aaba
cccc
cbac
babb
"""

I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.

The desired final result:

['abca', 'bcab', 'cbac']

I tried this:

re.findall('^(.)..\\1$', l, re.MULTILINE)

But it returns all of the unwanted words as well. I thought of using [^...] somehow, but I couldn't figure it out. There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.

Is it possible?

".*" is a fine regex to find characters that are the same or different. :) — user unknown, May 01 '12 at 20:17
You have seven answers all of which answer your question (although two of them say "this is better done by other means than regexes", with which I'm inclined to agree). You should accept one of them. — Gareth McCaughan, May 02 '12 at 09:17

Justin Morgan - On strike · Answer 1 · 2012-05-01T21:23:34.377

3

There are lots of ways to do this. Here's probably the simplest:

re.findall(r'''
           \b          #The beginning of a word (a word boundary)
           ([a-z])     #One letter
           (?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
           [a-z]*      #Any number of other letters
           \1          #The starting letter we captured in step 2
           \b          #The end of the word (another word boundary)
           ''', l, re.IGNORECASE | re.VERBOSE)

If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.

Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.

edited May 01 '12 at 21:23

answered May 01 '12 at 19:30

Justin Morgan - On strike

30,035
12
80
104

1

-1 the OP has already tried something similar and found it inadequate; you've missed the requirement that the middle letters should *not* match the beginning/ending. – Karl Knechtel May 01 '12 at 20:24
@KarlKnechtel - You're right, I did miss that part. My mistake. I've edited the pattern to account for that. – Justin Morgan - On strike May 01 '12 at 20:46
You'd do him a service by demonstrating the re.VERBOSE flag. It'd also make your answer shorter. – bukzor May 01 '12 at 21:13
@bukzor - Good call. As I said, I'm not too familiar with Python, so please let me know if I got the syntax wrong. – Justin Morgan - On strike May 01 '12 at 21:24

spinlok · Accepted Answer · 2012-05-02T04:08:09.373

3

Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for @AlanMoore and @bukzor explanations.

>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']

The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'

edited May 02 '12 at 04:08

answered May 01 '12 at 19:45

spinlok

3,561
18
27

You should explain what this does, since not everyone knows regex lookaround semantics. – Daenyth May 01 '12 at 20:16
spinlok: you seem to be asserting that the first group doesn't match \1, which seems impossible: `(.)(?<!\1)`. I believe you mean to use a negative *look-ahead* assertion `(?!)`, which is more efficient anyway. – bukzor May 01 '12 at 21:23
As a matter of fact, backreferences are illegal in lookbehinds. Did you actually test this? But @bukzor is right, it works if you change the lookbehinds to lookaheads. – Alan Moore May 01 '12 at 21:39
@AlanMoore, yeah I tested it - copied it straight from my bpython shell. Can you provide a reference for your claim that backreferences are illegal in lookbehinds? – spinlok May 01 '12 at 21:58
1

Not specifically, but regex engines have to treat backreferences as variable-length constructs, and that means they can't be used in lookbehinds (except in .NET and JGSoft, of course). Python *should* flag it as an error when the regex is compiled; that fact that it doesn't is a bug, as discussed [here](http://stackoverflow.com/questions/10279055/impossible-lookbehind-with-a-backreference). – Alan Moore May 01 '12 at 23:03
1

By the way, shouldn't those lookbehinds come *after* the dots, like this? `.(?<!\1).(?<!\1)` The way you have it, the first lookbehind seems to be asserting that the capturing group didn't really match what it just matched. ;) Anyway, I've tried your regex in several online, Python-powered regex testers, and it doesn't work in any of them. – Alan Moore May 01 '12 at 23:09
@AlanMoore: I have all the same objections, but it *does* seem to work, inexplicably. – bukzor May 02 '12 at 00:05
@AlanMoore and bukzor, my logic here is that the first lookbehind applies to the second character. See this example from the python docs: `>>> m = re.search('(?<=-)\w+', 'spam-egg')` `>>> m.group(0)` `'egg'` – spinlok May 02 '12 at 00:22
But *your* first lookbehind is looking at the *first* character--or should be. Instead, it seems to be acting as if the `\1` has zero length, so it backs up zero positions and tries to match something that isn't the same as `\1`. Since it's really looking at the *second* character, it can succeed if that's not the same as the first character. As a commenter said in the other Q/A I linked to, it's acting just like the lookahead `(?!\1)` would. Your regex is only working by accident; if you change those lookbehinds to lookaheads it will work because it's correct. – Alan Moore May 02 '12 at 01:26
@AlanMoore, thanks for the explanation! I've fixed the regex to use lookaheads instead. – spinlok May 02 '12 at 04:09

Karl Knechtel · Answer 3 · 2012-05-02T01:03:18.373

1

To heck with regexes.

[
    word
    for word in words.split('\n')
    if word[0] == word[-1]
    and word[0] not in word[1:-1]
]

edited May 02 '12 at 01:03

answered May 01 '12 at 20:13

Karl Knechtel

62,466
11
102
153

The `any()` is a little overkill. `word[0] in word[1:-1]` has the same effect. – bukzor May 01 '12 at 21:12
Er, yes, let me fix that, that was dumb. Simple is better than complex :) – Karl Knechtel May 02 '12 at 01:02

score 1 · Answer 4 · answered May 01 '12 at 21:10

1

Are you required to use regexes? This is a much more pythonic way to do the same thing:

l = """abca
bcab
aaba
cccc
cbac
babb
"""

for word in l.split():
  if word[-1] == word[0] and word[0] not in word[1:-1]:
     print word

answered May 01 '12 at 21:10

bukzor

37,539
11
77
111

score 1 · Answer 5 · answered May 01 '12 at 21:33

Here's how I would do it:

result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)

This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.

\b
([a-z])  # Capture the first letter.
(?:
  (?!\1)   # Unless it's the same as the first letter...
  [a-z]    # ...consume another letter.
){2}
\1
\b

I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.

This is an interesting twist. +1. Any reason you prefer this version other than clarity/brevity? — Justin Morgan - On strike, May 02 '12 at 03:54

score 0 · Answer 6 · answered May 01 '12 at 19:22

0

You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.

answered May 01 '12 at 19:22

Gareth McCaughan

19,888
1
41
62

score 0 · Answer 7 · 2012-05-01T22:01:01.820

Not a Python guru, but maybe this

re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)

expanded (use multi-line modifier):

^                # begin of line
  (.)            # capture grp 1, any char except newline
  (?:            # grouping
     (?!\1)         # Lookahead assertion, not what was in capture group 1 (backref to 1)
     .              # this is ok, grab any char except newline
  )*             # end grouping, do 0 or more times (could force length with {2} instead of *)
  \1             # backref to group 1, this character must be the same
$                # end of line

Using Regex to find words with characters that are the same or that are different

7 Answers7