0

Yesterday , I and my roommate discussed a question on the stack. And this questions is here

How to get the second column from command output?

They talk about how to separate the second column from the input stream like this:

1540 "A B"
   6 "C"
 119 "D"

And with the first upvoted answer

<some_command> | sed 's/^.* \(".*"$\)/\1/'

the result is perfectly satisfied with the request.

But then we find if we follow the greedy rule of regex, the pattern ^.*␣ will match 1540 "A which confused my roommate. With the benefit of hindsight, the pattern ^.*␣ should make a compromise with the pattern (".*"$). Otherwise, the second pattern would match nothing. However, my roommate can't be convinced by my hypothesis. So this guy give me another example to test and we did do it.

We made two experiment. The 1st add a quote " follow the character A like this:

1540 "A" B"
   6 "C"
 119 "D"

and it is easy to get this result with the previous regex code:

"A" B"
"C"
"D"

And for the 2nd one , we add a white space and a quote ␣" follow the A like this:

1540 "A " B"
   6 "C"
 119 "D"

the result is:

" B"
"C"
"D"

Until now, my roommate got more confused, cause his focus always concentrate on the second pattern (".*"$). And in his mind, the pattern (".*"$) should observer the same behavior between the two string 1540 "A" B" and 1540 "A " B" , so the second test's result should be "A " B" not rather " B". And I think for the second one , it's sure that the pattern ^.*␣ can't match this part 1540 "A" which will result in no match for the second pattern. But for the second experiment 1540 "A " B" , the two choice "1540 and 1540 "A seem all reasonable , the difference is that the former results from the greed of (".*"$) , the latter thanks to ^.*␣'s.

So can anyone give me an answer more specifically to discern which is the key in our confusion. Thanks .

Community
  • 1
  • 1
Sughiy
  • 45
  • 5
  • I think some sort of undefined behavior here... – vp_arth Jul 23 '15 at 09:29
  • @vp_arth undefined behavoir? Sorry I don't understand. In my opinion, the logic should be fixed even thought semantic is ambiguous for humans. Shouldn't the code always choose the same behavior? – Sughiy Jul 23 '15 at 09:52
  • I just believe, it's implementation dependent... I can be false here. Second `.*` can't be more greedy than first one... – vp_arth Jul 23 '15 at 10:34

1 Answers1

2

A .* pattern is greedy in the sense that it will first attempt to match as much as it can, and then backtrack through the string, matching less and less, when necessary. Regular expressions are matched left-to-right, which means the first .*'s greediness will dominate the second's in the case of ambiguity.

Let's apply this idea to 1540 "A" B".

Simplified, the regex is:

^.* (".*")$
  • First, try to match the whole string with the first .*.
    • Drat, we need to match a space next!
  • OK, Let's try everything until the last space: 1540 "A".
    • No good. A quote " must follow the space.
  • Well, let's look backwards a bit further... Here, the space after 1540 is followed by a quote.

Then, it will match the rest of the expression and succeed. So the greediest match for the first .* is 1540, and the group matches the rest of the string, "A" B".

Now let's apply it to 1540 "A " B".

  • First, try to match the whole string with the first .*.
    • Drat, we need to match a space next!
  • OK, Let's try everything until the last space: 1540 "A ".
    • No good. A quote " must follow the space.
  • Well, let's look backwards a bit further... Oh, look! The space after 1540 "A is followed by a quote! We can be slightly greedier than last time.

The greediest match for the first .* is now 1540 "A, and the group will match the rest of the string, " B".

Lynn
  • 10,425
  • 43
  • 75
  • thanks, I think that I forget to think about the quote `"` following the space, so in fact the regex would work together and then use the parentheses to capture the second matching pattern? – Sughiy Jul 23 '15 at 11:02