AWK program using regex to count matching lines

Question

The program is supposed to count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period.

I have

BEGIN {x=0}
/^\([0-9[0-9]*) [A-Z][A-z]* [a-z][a-z]* \.$/ {x = x+1}
END{print x}

I have them split on multiple different lines because I have been running display(!d) statements for debugging trying to figure it out. To run i use awk -f programName.awk filename.txt Any help is appreciated.

UPDATE

New code reads

BEGIN{x=0}
/^\([0-9]+\)[A-Za-z]+\.$/{x++}
END{print x}

I use vim EC.awk to edit this. awk -f EC.awk EC.txt to run comes back with 1. EC.txt contains 5 out of 12 lines that should be counted.

INPUT FILE vim EC.txt

(1) Line one, this should count.
(2)Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9)Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.

UPDATED CODE

BEGIN{x=0}
/^\([0-9]+\).*?[A-Za-z]+\.$/{x++}
END{print x}

This gives me 6. I believe its counting line 11 (11) BOI. Working on printing out the lines to make sure.

Not sure your regex is correct. This `^$[0-9]$[A-Za-z]*\.$` is a line beginning with a single digit in parenthesis, followed by u/l case letters and end with a period ? Change to `^$[0-9]+$[A-Za-z]*\.$` for many digits. — , Mar 03 '16 at 00:13
Popular question: http://stackoverflow.com/questions/35715258/awk-i-need-to-write-a-one-line-shell-command-that-will-count-all-lines-that http://stackoverflow.com/questions/28687756/shell-command-for-lines-that-have-decimal-number-in-parenthesis-upper-lower-ca — jas, Mar 03 '16 at 00:36
I changed to what you suggested, but still getting the same result I want, if I put {x=x+1} on the next line it prints/counts 12 which is the amount of lines in my test file but only 5 of the lines should be counted. if I have the {x=x+1} at the end of the line it prints/counts 1. — ChrisFocker, Mar 03 '16 at 00:36
can you post your `EC.txt` file? because then i could change the regex according to the input file — riteshtch, Mar 03 '16 at 01:53
What would the expected output be given that input? Which lines do you expect to match? — Ed Morton, Mar 03 '16 at 03:14
What do you hope to gain by being unclear about what you're trying to match ? Then piecemeal giving out specifics after repeated querry's about what exactly you want to match. I vote to close this as _unclear as to what you're asking_ and unresponsiveness. — , Mar 03 '16 at 15:55

score 5 · Accepted Answer · edited May 23 '17 at 12:07

^{For an alternative solution that expresses the intent more simply and clearly and is also locale-aware (doesn't invariably only match ASCII letters), see Ed Morton's helpful answer.}

Try the following (POSIX-compliant):

awk '/^\([0-9]+\).*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/ { ++x } END { print x+0 }' file

^$[0-9]+$ matches a decimal number in parentheses at the beginning of a line.
\.$ matches a literal period at the end of a line.
.*([A-Z].*[a-z]|[a-z].*[A-Z]).* matches any string in between that:
- Either: contains at least 1 uppercase letter followed by at least 1 lowercase one.
- Or: contains at least 1 lowercase letter followed by at least 1 uppercase one.
- Thus, this expression should match any string containing any mix of lower- and uppercase [ASCII-only] letters, as long as least 1 uppercase and 1 lowercase letter is present.

As for why your approach didn't work:

Your initial solution attempt, [A-Z][A-z] *[a-z][a-z]*, only matches lines whose first [ASCII] letter on the line is uppercase; in other words: lines where the first letter on the line is lowercase aren't matched.
Your later solution attempt, [A-Za-z]+, due to using a single character set any of whose characters are matched, also matches lines containing only uppercase or lowercase letters, which is why line (11) BOI. also matches.

Ed Morton · Answer 2 · 2016-03-03T03:27:18.820

3

idk if this is the expected output or not since you didn't include that in your question but I just coded what you said in your question count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period and added the print so you can see what it matches so take a look and see if it does what you want:

$ cat tst.awk
/^\([0-9]+\)/ && /[[:upper:]]/ && /[[:lower:]]/ && /\.$/ { print; cnt++ }
END { print cnt+0 }

$ awk -f tst.awk file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5

Don't get stuck thinking that the condition part of an awk statement has to be a regexp, like if this was sed or grep, as it doesn't - it can be a compound condition of ands/ors of regexp segments if that's what makes your code simpler and clearer as in this case IMHO.

edited Mar 03 '16 at 03:27

answered Mar 03 '16 at 03:18

Ed Morton

188,023
17
78
185

1

++ for locale-awareness and clarity of intent; I haven't tested, but perhaps you know from experience: in terms of performance, how does this compare to my single-regex solution (assuming both solutions use either `[A-Z]` / `[a-z]` or `[[:upper:]]` / `[[:lower:]]`)? – mklement0 Mar 03 '16 at 03:27
1

Sorry, no I don't know. I THINK I remember someone doing a test in a different question that showed character classes were slightly faster than equivalent ranges but once you add in the ands/ors and everything else I couldn't guess which solution would be faster. They'll both run in the blink of an eye even for large files anyway though. – Ed Morton Mar 03 '16 at 03:30

riteshtch · Answer 3 · 2016-03-03T05:00:23.597

Your regex tries to match the following text (1 or more digits)<space><1 or more Uppercase><space><1 or more lowercase><space><period>

I think while posting the question you have missed out the ] in case of digits, and if you want to have lowercase followed by uppercase then you must use your regex; but since you mentioned in your question it can be a mix of uppercase and lowercase you will have to use [A-Za-z]+. + ensures 1 or more i.e [a-z]+ is equivalent to [a-z][a-z]*

$cat file.txt 
(1) aBCdadg .
(2) dgshdf .
(3) DFHFH .
xyz
abcd
(56) sdflgkfd .
$ cat prgm.awk 
BEGIN {x=0}

/^\([0-9]+\) [A-Za-z]+ \.$/ {x++}

END {print x}
$ awk -f prgm.awk file.txt 
4
$

And if you want to have 1 or more lowercase chars followed by 1 or more uppercase then you will have to use the this regex:

/^\([0-9]+\) [a-z]+ [A-Z]+ \.$/ {x++}

Edit:

$ cat file.txt 
(1) Line one, this should count.
(2) Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9) Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.
$ cat prgm.awk 
BEGIN {x=0}

/^\([0-9]+\)\s*[A-Za-z0-9., ]+\s*\./{x++}

END {print x}
$ awk -f prgm.awk file.txt 
5
$

Edit 2: Sorry i was in a hurry to go somewhere and was off my comp for few hours. Since its more clear what you need, i'll just update the answer for completeness.

$ cat prgm.awk 
BEGIN {x=0}

/^\([0-9]+\).*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/{x++;print $0}

END {print x}
$ awk -f prgm.awk input_file.txt 
(1) Line one, this should count.
(2) Line two. Should also count.
(5)Yes.
(9) Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5
$

Do mark the question solved by accepting anyone's answer apart from mine :P :)

Edit 3: give others the credit.

Thanks for the help, I changed my program to use the + based on jas comment with the other stack overflow question. Your answer has asserted that this is the correct way to do it. Thanks for clarifying what the + does. I changed {x=x+1} to {x++} which i realize is the same thing, and now my code matches yours. I run it and now it prints/counts 0. In theory everything seems/looks right. Frustrating. — ChrisFocker, Mar 03 '16 at 01:38
Ok, so that is printing line 11 BOI. which shouldn't be printed. The condition is a parenthesis with a number inside followed by a line with u/l and ends with a period. — ChrisFocker, Mar 03 '16 at 02:38

dawg · Answer 4 · 2016-03-03T03:49:05.117

1

It is best to break down the conditions into separate regex's sometimes:

Lines begin with a decimal number in parenthesis: /^$[0-9]+$/ or /^$[[:digit:]]+$/
Containing upper case letters: /[A-Z]/ or /[[:upper:]]/
Containing lowercase letters: /[a-z]/ or /[[:lower:]]/
End with a period: /\.[ \t]*$/ (the [ \t]* catches trailing spaces if any...)

Now just combine those conditions:

awk '/^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { print }' file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.

Then run through wc -l to get the line count:

awk '//^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { print }' file | wc -l
5

Or, maintain your own count:

awk '/^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { i++ } END{print i}' file
5

The issue with your regex:

/^\([0-9]+\).*?[A-Za-z]+\.$/
            ^^                       Any string of characters
                 ^ ^                 Could be 'UPPER' or 'lower'

The .* matches all characters (including spaces) leading up to,
[A-Za-z]+ which matches a run of upper and/or lower case letter but does not tell you if you have both.

Almost, but you are not detecting properly lines that fail to include both upper and lower case letters with that regex.

edited Mar 03 '16 at 03:49

answered Mar 03 '16 at 03:00

dawg

98,345
23
131
206

You should mention that `\s` is gawk-specific and that in some locales `a-z` means `aAbBcC...z` while `A-Z` means `AbBcC...zZ` so `/[a-z]/ && /[A-Z]/` would be satisfied by just the letter `c` for example. Also in the END you need to say `print i+0` so it'll print zero instead of a blank line if there's no matches. – Ed Morton Mar 03 '16 at 03:35
@EdMorton: Say more: What locale would `/[A-Z]/` include lower case? – dawg Mar 03 '16 at 03:40
1

idk off the top of my head, let me google for one and I'll get back to you. – Ed Morton Mar 03 '16 at 03:42
I haven't found a specific locale yet for ANY set but there's a discussion of the issue in the gawk manual: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html. I'll keep looking for a bit but this isn't real high on my priority list! – Ed Morton Mar 03 '16 at 03:47
1

Found one - according to http://teaching.idallen.com/cst8177/13w/notes/000_character_sets.html#character-set-collation-order---lc_collate en_US.utf8 is one such locale. I've never found anything to suggest that `[0-9]` includes anything but digits btw so I tend to still use that vs `[[:digit:]]` for conciseness. – Ed Morton Mar 03 '16 at 03:53
2

As to `[[:digit:]]` vs `[0-9]` consider [this list](http://stackoverflow.com/a/891741/298607) :-0 – dawg Mar 03 '16 at 04:19
1

@EdMorton: Your first link tells me that POSIX in 2008 declared the behavior of ranges in locales other than "C" and "POSIX" as _undefined_. In practice, it seems that at least the most common utilities exhibit the _traditional_ behavior with ranges such as `[a-z]` and `[A-Z]` (ASCII only, distinction between upper- and lowercase, as expected), even in locales such as `en_US.UTF-8`. The following commands demonstrate that (nothing matches): `for char in 'c' 'ü'; do grep '[A-Z]' <<<"$char"; sed -n '/[A-Z]/p' <<<"$char"; awk '/[A-Z]/' <<<"$char"; tr -dC 'A-Z' <<<"$char"; done` – mklement0 Mar 03 '16 at 06:39
@mklement0 Maybe LC_COLLATE needs some specific value too? I don't feel that test proves there's no locales that interleave upper and lower case letters. Maybe it does, idk, I'm certainly no expert in the area nor do I have any interest in the subject of locales per se but there's enough web sites out there claiming that those locales do exist that I'd rather just stick to using character classes. If others would prefer to rely on ranges, that's absolutely fine with me. – Ed Morton Mar 03 '16 at 12:43
@EdMorton: Absolutely agreed that the POSIX character classes are generally the best choice. The question is: If and when you need it, can you get traditional, case-specific ASCII-only matching with `[a-z]` and `[A-Z]`, irrespective of the active locale? I think the answer is yes, at least in UTF-8-based locales, according to my quick tests, both with BSD and GNU utilities (tested with all installed UTF-8-based locales, with `LC_ALL` set). The interleaving you mention probably applies to contexts such as the `strcoll()` library functions (I'm no expert either). – mklement0 Mar 03 '16 at 20:27
Then what is the example in https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html about: `echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }'` outputs `something1234a` because "the ‘bc’ at the end of ‘something1234abc’ should not normally match ‘[A-Z]*’. This result is due to the locale setting". That question is actually rhetorical as I REALLY don't care about the gory details of locales and I'll keep using character classes and advising people to use them but if others don't then no problem. – Ed Morton Mar 03 '16 at 20:32

AWK program using regex to count matching lines

4 Answers4