-6

I am looking for a regex which matches words where the first two letters are equal to the last two letters. An example can clarify the requirement.

Given the following text:

The dodo was one of the sturdiest birds. An educated termite may learn how to operate a phonograph, but it's unlikely. I sense that an amalgam that includes magma will enlighten Papa.

How can I get this output:

answer = [('dodo', 'do'), ('sturdiest', 'st'), ('educated', 'ed'),
          ('termite', 'te'), ('phonograph', 'ph'),
          ('sense', 'se'), ('amalgam', 'am'), ('magma', 'ma'),
          ('enlighten', 'en')]

As you can see the 2 initial characters are the same as the last 2.

My thought is to filter any word that has the length of 4 characters or more, and with the first 2 characters of the word matching the last two.

So far I am up to word that is 4 or more characters.

[A-Za-z]{4,}

I don't need a complete program, I only need the regex.

Francesco
  • 3,200
  • 1
  • 34
  • 46

2 Answers2

0

You can use the following regex:

(\w{2})\w*\1

Explanation:

  • (\w{2}) : match any two letters and put them in capture group 1 ( )
  • \w* : match zero or more of letters
  • \1 : match exactly those two letters which were captured in 1st group of parentheses

See Regex DEMO

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
  • dodo is not on the list, and thanks Also how do I output as the same as answer shown – Lee ChunHong Sep 16 '15 at 16:34
  • @LeeChunHong check the update :) and `\1` or `\2` is the back reference to the 1st or 2nd captured group – karthik manchala Sep 16 '15 at 16:36
  • 1
    explain downvote plz? so that i can improve the answer? – karthik manchala Sep 16 '15 at 16:36
  • I didn't downvote you, but I have a suggestion. Get rid of the external parentheses and using something like `'([A-Za-z]{2})[A-Za-z]*\\1'` - or, if you don't mind numbers in your words, `'(\w{2})\w*\\1'`. – Jake Sep 16 '15 at 16:39
  • so what you have done so far, the result generates all the words expect for dodo and how do I print my output exactly the same as the answer has shown I didn't know what downvote is, so is less likely that I downvoted you – Lee ChunHong Sep 16 '15 at 16:40
  • 1
    You asked for regular expressions only. 'dodo' should be in the list. Show us what you have in your python, then we can make suggestions. – Jake Sep 16 '15 at 16:41
  • dodo is now in the list but how do I print the output to make it looks like exactly the same as the answer? So far you guys has successfully print out the words that matches the requirement but I need to make the output looks like the answer to get a match – Lee ChunHong Sep 16 '15 at 16:46
  • I downvoted because when I saw the answer 1) it did not work and 2) there was no explanation, which means this answer isn't going to be useful to anyone else unless they have an identical problem (which is probably also why the question itself is being downvoted). – user812786 Sep 16 '15 at 16:49
  • r'(([a-zA-Z]{2})\w*\2)', this is the answer. Thank you guys and can someone please explain to me what does that r do. Thanks – Lee ChunHong Sep 16 '15 at 16:55
  • @whrrgarbl thats a valid argument.. i tried to add explanation and working regex.. thanks :) – karthik manchala Sep 16 '15 at 16:56
  • glad to help, undid the vote. @Lee - the r is a prefix that encodes as a "raw string literal". [This question](http://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l) might be helpful to look at. – user812786 Sep 16 '15 at 16:58
  • @LeeChunHong "The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation." – karthik manchala Sep 16 '15 at 16:58
  • Thank you guys solid effort all around. However here is some thinking, what if I want a regex only matches the last 2 characters instead of a wildcard to match every single character in a word – Lee ChunHong Sep 16 '15 at 17:13
  • @LeeChunHong that would still require parsing them and matching 1st and last two characters, so performance wise there would be no difference – karthik manchala Sep 16 '15 at 17:18
0

Using a variant over the regex provided by the answer of karthik manchala, and noticing that you want the same output as given in your question here is a complete code example:

import re

inputText = """The dodo was one of the sturdiest birds.
An educated termite may learn how to operate a phonograph,
but it's unlikely. I sense that an amalgam that includes
magma will enlighten Papa."""

regex = re.compile(r"((\w{2})\w*\2)")
answer = regex.findall(inputText) 
print("answer = {}".format(answer))

Note that in addition to capturing the group of the two first characters, (\w{2}), allowing for arbitrary number of characters inbetween, \w*, and finally matching the first group at end, \2, I've surrounded the entire regexp with another group of parentheses, ( ... ).

When running this the entire word will be \1, whilst the two character group is \2, and using findall will find all occurences and return a list of tuples, where each tuple is the capture groups.

holroy
  • 3,047
  • 25
  • 41