1

I need to create a regular expression (for program in haskell) that will catch the strings containing "X" and ".", assuming that there are 4 "X" and only one ".". It cannot catch any string with other X-to-dot relations. I have thought about something like

[X\.]{5}

But it catches also "XXXXX" or ".....", so it isn't what I need.

Enri
  • 25
  • 5

4 Answers4

6

That's called permutation parsing, and while "pure" regular expressions can't parse permutations it's possible if your regex engine supports lookahead. (See this answer for an example.)

However I find the regex in the linked answer difficult to understand. It's cleaner in my opinion to use a library designed for permutation parsing, such as megaparsec.

You use the Text.Megaparsec.Perm module by building a PermParser in a quasi-Applicative style using the <||> operator, then converting it into a regular MonadParsec action using makePermParser.

So here's a parser which recognises any combination of four Xs and one .:

import Control.Applicative
import Data.Ord
import Data.List
import Text.Megaparsec
import Text.Megaparsec.Perm

fourXoneDot :: Parsec Dec String String
fourXoneDot = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
    where mkFive a b c d e = [a, b, c, d, e]
          x = char 'X'
          dot = char '.'

I'm applying the mkFive function, which just stuffs its arguments into a five-element list, to four instances of the x parser and one dot, combined with <||>.

ghci> parse fourXoneDot "" "XXXX."
Right "XXXX."
ghci> parse fourXoneDot "" "XX.XX"
Right "XXXX."
ghci> parse fourXoneDot "" "XX.X"
Left {- ... -}

This parser always returns "XXXX." because that's the order I combined the parsers in: I'm mapping mkFive over the five parsers and it doesn't reorder its arguments. If you want the permutation parser to return its input string exactly, the trick is to track the current position within the component parsers, and then sort the output.

fourXoneDotSorted :: Parsec Dec String String
fourXoneDotSorted = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot

    where mkFive a b c d e = map snd $ sortBy (comparing fst) [a, b, c, d, e]
          x = withPos (char 'X')
          dot = withPos (char '.')
          withPos = liftA2 (,) getPosition

ghci> parse fourXoneDotSorted "" "XX.XX"
Right "XX.XX"

As the megaparsec docs note, the implementation of the Text.Megaparsec.Perm module is based on Parsing Permutation Phrases; the idea is described in detail in the paper and the accompanying slides.

Benjamin Hodgson
  • 42,952
  • 15
  • 108
  • 157
  • Wow... Nice work ! However I think it's a bit too much for his simple question ^^ – Gawil Jun 09 '17 at 15:57
  • @Gawil I think it's simpler to write readable code using a library that's fit for purpose than it is to abuse RE lookahead. – Benjamin Hodgson Jun 09 '17 at 15:59
  • Abuse RE lookahead ? Don't over react like that. – Gawil Jun 09 '17 at 16:01
  • @Gawil Apologies. I just meant that regexes, even with lookahead, weren't designed for this, and while it's possible to do it with regexes the resulting expressions aren't exactly simple or readable. – Benjamin Hodgson Jun 09 '17 at 16:07
5

The other answers look quite complicated to me, given that there are only five strings in this language. Here's a perfectly fine and very readable regex for this:

\.XXXX|X\.XXX|XX\.XX|XXX\.X|XXXX\.
Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380
  • 1
    +1, this is a nice and simple answer for the question as posed. But, _caveat lector_, the size of the regex blows up exponentially with the size of the alphabet, so I wouldn't recommend this approach for anything much more complex than this – Benjamin Hodgson Jun 10 '17 at 16:27
2

Are you attached to regex, or did you just end up at regex because this was a question you didn't want to try answering with applicative parsers?

Here's the simplest possible attoparsec implementation I can think of:

parseDotXs :: Parser ()
parseDotXs = do
  dotXs <- count 5 (satisfy (inClass ".X"))
  let (dots,xS) = span (=='.') . sort $ dotXs
  if (length dots == 1) && (length xS == 4) then do
     return ()
  else do
     fail "Mismatch between dots and Xs"

You may need to adjust slightly depending on your input type.

There are tons of fancy ways to do stuff in applicative parsing land, but there is no rule saying you can't just do things the rock-stupid simple way.

Iron Gremlin
  • 407
  • 2
  • 11
0

Try the following regex :
(?<=^| )(?=[^. ]*\.)(?=(?:[^X ]*X){4}).{5}(?=$| )

Demo here

If you have one word per string, you can simplify the regex by this one :
^(?=[^. \n]*\.)(?=(?:[^X \n]*X){4}).{5}$

Demo here

Gawil
  • 1,171
  • 6
  • 13
  • Unfortunately, I have everything mixed, so the second regex won't work. The first one doesn't work with haskell: lexical error in string/character literal at character 'S' – Enri Jun 09 '17 at 16:36
  • @Enri: Oh my bad... I thought Haskell would handle \S. You can replace it easily though, I'll update my answer. – Gawil Jun 12 '17 at 07:18