Expressing a match beyond a $ in regex

Question

I have tried the following tests with awk:

Example:
If I have a file that has:

miz[space][space][end-of-line]  
[empty line]  
pel

If I do:

$ cat mul.txt |awk 'sub(/miz\s+/,"misspell")'  
misspell

awk finds the pattern.

But if I remove the 2 spaces from the first of the line:

miz[end-of-line]  
[empty line]  
pel

I get:
$ cat mul.txt |awk 'sub(/miz\s+/,"misspell")'

I.e. awk does not match.

It seems that there is some subtlety between $ and \s that I fail to understand.
Also I can not seem to find a way to express a regex that includes a match beyond a $ but the first snippet works.
Could someone please explain what is the issue here?

Update:
This: $ cat mul.txt |awk 'sub(/miz(\s+|$|^$|^\s+$)+pel/,"misspell")' does not work either

See the answers to a similar question: [can awk patterns match multiple lines?](http://stackoverflow.com/questions/14350856/can-awk-patterns-match-multiple-lines) — , Feb 02 '14 at 16:25
Not familiar with Awk but Silviu Burcea is onto something. Your `\s+` is telling the regex engine that there has to be at least 1 space and possibly more, where `\s*` would accept when there are zero spaces or more. So `$ cat mul.txt |awk 'sub(/miz\s*/,"misspell")'` should do it? — asontu, Feb 02 '14 at 19:38

score 2 · Answer 1 · answered Feb 02 '14 at 16:24

2

First of all \s is gnu-awk specific and non-gnu awk don't support it. Now coming back to your problem you can use a null RS (Record Separator) like this and your regex will work in both the cases:

 awk 'sub(/miz[[:space:]]/,"misspell")' RS='\0' file

Take note of RS="\0' which set RS with null byte.

answered Feb 02 '14 at 16:24

anubhava

761,203
64
569
643

1

`\s is gnu-awk specific and non-gnu awk don't support it` I thought that `\s` is the regex for whitespace.Where does gnu version fit in here? – Jim Feb 02 '14 at 16:29
I tried your answer.It seems to just match the first line: `$ awk 'sub(/miz[[:space:]]/,"misspell")' RS='\0' mul.txt` I get: `misspell pel` – Jim Feb 02 '14 at 16:30
1

@Jim `\s` is used by Perl and many other regular expression parsers. However, awk is a POSIX utility, requiring usage of POSIX regular expressions. This means `[[:space:]]` is the proper portable form of a space character class as far as awk, sed, grep, etc. are all concerned. Awk uses POSIX extended regular expressions (EREs), sed uses POSIX basic regular expressions (BREs), and grep uses BREs unless you add the -E option to use EREs instead. – Feb 02 '14 at 16:33
@Jim: The reason why you get `misspell\n\n\npel` for first case and `misspell\npel` for second case because `\n` is also matched and replaced by `[[:space:]]` (same is the case with `\s`) – anubhava Feb 02 '14 at 16:38
1

+1. This: `$ awk 'sub(/miz[[:space:]]+pel/,"misspell")' RS='\0' mul.txt` seems to match exactly. So `[[:space:]]` match also a `$`?I am not sure how the match happens here – Jim Feb 02 '14 at 16:43
Yes `[[:space:]]` (or `\s`) means whitespace that matches space OR tab or newline. – anubhava Feb 02 '14 at 16:45
@anubhava:But I am asking about end of line `$` – Jim Feb 02 '14 at 16:54
See awk processes input line by lie. What is your expected output from 1st & 2nd match? – anubhava Feb 02 '14 at 17:05
1

@Jim End of line means before the newline character. The `[:blank:]` character class matches blank characters (tab and space characters). The `[:space:]` character class matches whitespace, which includes `[:blank:]` as well as the newline, carriage return, form feed and vertical tab characters. – Feb 02 '14 at 17:27
1

@Jim And if you're wondering why I wrote `[:space:]` instead of `[[:space:]]`, it is because `[[:space:]]` matches a character belonging to the `[:space:]` character class. That is, character classes are only recognized inside `[` and `]`. For example, `echo bar | grep '[:blank:]'` outputs a match, but `echo bar | grep '[[:blank:]]'` does not. – Feb 02 '14 at 17:38

Sabuj Hassan · Answer 2 · 2014-02-02T16:27:53.097

0

Use this regex so that it can handle both space and end of line:

/miz([ ]+|\n)/

edited Feb 02 '14 at 16:27

answered Feb 02 '14 at 16:06

Sabuj Hassan

38,281
14
75
85

This does not work either: `$ cat mul.txt |awk 'sub(/miz(\s+|$|^$|^\s+$)+pel/,"misspell")'` for the second form of the file – Jim Feb 02 '14 at 16:13
`\s` doesn't work with my awk interestingly!! anyway, you can use `\n` instead of `$` for my example. For example this works with my awk `/miz([ ]+|\n)/` – Sabuj Hassan Feb 02 '14 at 16:27

Expressing a match beyond a $ in regex

2 Answers2