6

Using grep, I am trying to match lines which are comprised of two characters, one followed repeated followed by the other, but only match when the number of first character occurances is equal to the occurrences of the second character.

As an example, imagine that I can only match two characters like '0' and '1'. Now imagine that if there are n '0' characters then there must be n '1' characters following directly afterward. For example:

  • ''
  • '0011'
  • '000111'
  • '00000000001111111111'

would all match. But:

  • '011'
  • '1100'
  • '110001'

wouldn't match.

I've been playing around with capture groups and trawling through perldoc for more info on grep -P but haven't found any leads to solve my problem - with grep at least.

How could I make a grep command to match strings given these constraints?

EDIT:

  • In this example, the 0s should come before the 1s as per the restriction "following directly afterward"
  • The empty string should also be a match case because by the example restrictions, when there are n 0s there should be n 1s, so with zero 0s there should be zero 1s.
Matthew Brian
  • 105
  • 1
  • 5
  • 3
    `I've been playing around with capture groups and trawling through perldoc for more info on grep -P` Good that you have mentioned that you have tried many things, please do add them in your question to avoid close votes on your question, Thank you. – RavinderSingh13 Feb 27 '21 at 06:06
  • 3
    Showing what you've done/tried helps people to help you. It helps a lot. It makes a big difference – zdim Feb 27 '21 at 06:23
  • 3
    are you sure *1100* must not match ? You speak about first and second character whatever they are, not about some 0 then same number of 1 – bruno Feb 27 '21 at 09:58
  • 2
    Also why the empty line must match ? An empty line does not use 2 characters. Please clarify your question editing it – bruno Feb 27 '21 at 10:19
  • 1
    @bruno 1100 fails in this sort of problem because the 0s have to be before the 1s. And n can be zero, hence an empty string matching. – Shawn Feb 27 '21 at 10:49
  • 2
    @Shawn for me this is not clear at all, the first part of the question is a statement but the second part just an example. The fact an empty line match is also strange and not compatible with the rest. Hope OP will clarify – bruno Feb 27 '21 at 10:52
  • 1
    @bruno Seems perfectly clear from OP's description. Plus it's the canonical example of a context free language that can't be matched by (traditional) regular expressions. Given there have been multiple questions about it in recent days, I'm guessing some CS school's language theory class is going over the topic. – Shawn Feb 27 '21 at 11:00
  • 1
    @Shawn "_Seems perfectly clear from OP's description_" -- no, not to me, it doesn't. They don't specify, nor even mention, any of that; just an example which makes it seem perfectly clear to me that it can be any two characters (of course in either order). I followed the "breadcrumb trail" of links and I see what you mean, but that is _not_ stated in this question. – zdim Feb 27 '21 at 20:36
  • @bruno Though I thought I explained it will enough, to make it more clear I'll add an edit to better clarify that the zeroes should come before the ones, and explain why the empty string matches. Thanks for the comment – Matthew Brian Feb 27 '21 at 20:55

4 Answers4

5

See EDIT below for update to clarifications


Here's a Perl one-liner instead of grep

perl -wne'print if /^((.)\g{-1}+)((.)\g{-1}+)$/ and length $1 == length $3' file

The length comparison of matches is clearly done outside of regex; I don't see that it can be done inside nicely, and I don't see anything wrong with using code which isn't regex :)

This doesn't match with a single character (ab), what wouldn't really make sense and what seems excluded from the question. The anchors (^ and $) make it so that it can match only strings with two characters, what seems to be specified.

That \g{-1} is a relative backreference. It matches the same subpattern that was captured last, which is what we need instead of a simple backreference (\g1).

This is need because \g1 refers to the first capture, the set of parens started first (leftmost), which is the capture of the whole pattern. (We can use \g2 but it is bad practice to count them off.)

This can be made nicer by using named references, but then it would be more elaborate as well.


EDIT   Following the clarifications, whereby it must be 0s first then the same number of 1s, and 0-repetitions count (so an empty line), as well as 1-repetition of course (so 01). This simplifies matters greatly, for just

perl -wne'print if /^(0*)(1*)$/ and length $1 == length $2' file

The 0 and 1 can be made into variables which can be supplied as external arguments, if desired (so it can be any grammar, a and b etc).

It prints as expected on the example input from the question, so on input file

0011

000111
00000000001111111111
01

011
1100
110001

it prints

0011

000111
00000000001111111111
01

(the last empty line in output being the empty line in the middle, after which no more lines match)


That is, without employing tricky features that run code inside regex, which would make it far more complex. If you still wish to play with that see it in perlre and in perlretut.

Or, this can also be done using recursion in regex, with similar (or little lesser?) complexity.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • you match 1100 which is probably right even the OP question says the contrary, but you do not match empty line (supposing not an other error of the OP) – bruno Feb 27 '21 at 10:19
  • @bruno Thanks for commenting ... um, how is `1100` not right? Two characters, twice each ...? (They didn't say anything about whether one of them must come before the other. Their example is _wrong_, as you nicely raised in a comment.) And why would it match an empty line? That cannot possibly be right. – zdim Feb 27 '21 at 10:21
  • yes all of that is strange, I put 2 remarks in the OP question, hoping clarification will be made – bruno Feb 27 '21 at 10:24
  • 1
    @bruno Right, thank you for doing that, and we can hope that they do clarify. It is possible that they indeed mean that `0` must come before `1` but they would _have to_ clearly specify that since it may mean that there are particular characters being sought, or that there is some additional criterion (smaller number first?) ... Without theirs stating this clearly I must assume that it's an error in the example (easy to assume given that there is an empty line in that example) – zdim Feb 27 '21 at 10:28
  • @zdim thanks for the answer, the explanation of Perl features was really good. I thought saying "n '1' characters following directly afterward" would be enough to state in the example that the 1s must come after the 0s. Obviously this wasn't clear enough and I have edited the question to clarify. – Matthew Brian Feb 27 '21 at 21:07
  • This also seems to match cases like 1111 and 00000000 – Matthew Brian Feb 27 '21 at 21:23
  • @MatthewBrian Alright, thank you very much for responses and clarifications. I'll edit to adjust (or to add) shortly and will let you know – zdim Feb 27 '21 at 21:40
  • @MatthewBrian Updated; it's now fairly trivial with fixed characters, which can themselves be supplied to the "program" (it's a program even is it is all on the command-line). It's still a Perl program though, which can't be lifted into `grep`'s PCRE regex since it uses the language to compare the length of matches – zdim Feb 27 '21 at 21:58
3

With your shown samples only, in case you are ok with awk you could try following.

awk 'match($0,/^0+/){num1=RLENGTH;match($0,/1+/);if(num1==RLENGTH){print}}' Input_file

Explanation: Adding detailed explanation for above.

awk '                          ##Starting awk program from here.
match($0,/^0+/){               ##Using match function to match starting zeroes here.
  num1=RLENGTH                 ##Creating num1 here with rlength.
  match($0,/1+/)               ##Matching all ones now.
  if(num1==RLENGTH){ print }   ##Checking condition if num1 is equal to current length then print the line.
}
' Input_file                   ##mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
3

This awk one line should do the job:

cat file

0011

000111
00000000001111111111
011
1100
11000
awk '/^0*1*$/ && gsub(/0/, "&") == gsub(/1/, "&")' file

0011
000111
00000000001111111111

Or if you want to print numbers that may have 1s followed by 0s then use:

# awk command
awk '/^(0*1*|1*0*)$/ && gsub(/0/, "&") == gsub(/1/, "&")' file

0011
000111
00000000001111111111
1100

gsub function returns number of replacements.


Since you've used grep tag, here is one gnu grep command with -P (PCRE recursive) regex:

grep -P '^(0(?1)?1|1(?1)?0)?$' file

0011
000111
00000000001111111111
1100

grep RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    *011* and *110001* must not match for sure. The fact *1100* must not match is probably an error in the OP question. Except for empty line you return the reverse of the expected answer – bruno Feb 27 '21 at 09:59
  • 2
    Thanks @bruno I was printing non-matches earlier, but reversed it now. I am not sure if OP wants `1`s followed by `0`s also but have included that option also in my answer. – anubhava Feb 27 '21 at 10:08
  • 1
    Hopefully this has been clarified in the most recent edit – Matthew Brian Feb 27 '21 at 21:13
1

This is impossible to do using true regular expressions, but it is possible to do with Perl regular expressions thanks to recursion.

/
   ^ (?&BALANCED)?+ \z

   (?(DEFINE)
      (?<BALANCED> 0 (?&BALANCED)?+ 1 )
   )
/x

Terser:

/^((?:0(?1)1)?+)\z/

Demo:

$ perl -nle'print if /^((?:0(?1)1)?+)\z/' <<'.'

0011
000111
00000000001111111111
011
1100
110001
.

0011
000111
00000000001111111111

See Specifying file to process to Perl one-liner.


PCRE also supports recursion. You could therefore use the following:

grep -P '^((?:0(?1)1)?+)$'

Demo:

$ grep -P '^((?:0(?1)1)?+)$' <<'.'

0011
000111
00000000001111111111
011
1100
110001
.

0011
000111
00000000001111111111
ikegami
  • 367,544
  • 15
  • 269
  • 518