16

I'm looking for a regex to match hyphenated words in Python.

The closest I've managed to get is: '\w+-\w+[-w+]*'

text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)

which returns list ['one-hundered-and-three-', 'foo-bar'].

This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]\*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.

bad_coder
  • 11,289
  • 20
  • 44
  • 72
Sixhobbits
  • 1,508
  • 3
  • 17
  • 26
  • 2
    I don't know what you plan to use this for, but have you considered cases where a trailing or prefixed hyphen is [valid](http://en.wikipedia.org/wiki/Hyphen), like "nineteenth- and twentieth-century" or "investor-owned and -operated"? – Lauritz V. Thaulow Dec 05 '11 at 09:38
  • 1
    The main problem in your own expression are the square brackets. They don't group the content together, they create a character class, thats something completely different. – stema Dec 05 '11 at 09:46
  • Thanks for the input, lazyr. I have considered the cases you point out, and they will not pose a problem. Thanks for the clarification, stema. I realised that the square brackets did not group the content, but they resulted in the closest match for what I was attempting to do. – Sixhobbits Dec 05 '11 at 11:55

1 Answers1

31

Try this:

re.findall(r'\w+(?:-\w+)+',text)

Here we consider a hyphenated word to be:

  • a number of word chars
  • followed by any number of:
    • a single hyphen
    • followed by word chars
a'r
  • 35,921
  • 7
  • 66
  • 67