grep command for counting no of occurrences of a string in a file giving lesser number

Question

I've a 6GB XML file that has only one line (verified with wc -l file.xml)

This is the command I'm using : grep -o '<wd:Report_Entry>' file.xml | wc -l and it's outputting 446441. This is supposed to be the right command as mentioned at https://stackoverflow.com/a/14510665/5524175.

The correct count is 1521620. Surprisingly, this rust solution gives the right count. count_occurences '<wd:Report_Entry>' file.xml gives 1521620.

Also, the following command mentioned in this accepted answer also gives 446441.

sed 's/<wd:Report_Entry>/<wd:Report_Entry>\n/g' file.xml | grep -c "<wd:Report_Entry>"

I'm not sure what I'm missing. Escape characters like < or > or :? I'm on macOS. This is my grep version.

➜  ~ grep --version
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD

Can you reproduce with a smaller file? E.g. with a file you create manually with a few `` here and there? — norbjd, Aug 24 '23 at 09:44
It is a recurrent venue here. People asking for way to parse XML with sed or grep or awk. Dudes, you are basically trying to open a lock with a screwdriver. You may be able to with enough brute-force if you don't mind the resulting mess. — Léa Gris, Aug 24 '23 at 10:33
Your `grep` pattern does not use any specific regex characters. For the safe side, you could also use the grep option `-F`, to disable all regex matching, or do a `grep -o ...|less` and check for empty lines or other anomalities. Perhaps the string you are looking for, also occurs in some XML comments, which are ignored with the other methods you tried? — user1934428, Aug 24 '23 at 10:40
This might be related [(SU) Is there a limit for a line length for grep command to process correctly?](https://superuser.com/questions/1703029/). It states that the line-length that GNU grep can handle is limited by the memory of the system. — kvantour, Aug 24 '23 at 10:41
[The rust answer you linked](https://stackoverflow.com/a/58994637/1745001) gives you a clue - "Grep runs out of memory even on a machine with 768 GB of RAM!". What does `awk -v RS='' 'END{print NR-1}' file.xml` **using GNU awk for multi-char RS** (so not the default BSD awk on MacOS - you'll have to install GNU awk) output? There's nothing special about `<` or `>` in a regexp by the way, they're just literal characters as long as you don't put a ```\``` in front of them as that'd turn them into word boundaries in some regexp engines. — Ed Morton, Aug 24 '23 at 11:58
Wow. @EdMorton, your command worked! It outputs `1521620`. grep ran out of memory in my linux server but in my mac, it didn't give any error. It should've given some error instead of giving `446441 ` — duplex143, Aug 24 '23 at 12:13
Good, I posted a more robust version of that gawk script with an explanation of it as [an answer](https://stackoverflow.com/a/76969187/1745001) now. I expect you could report that grep issue of no error message to the BSD folks or whoever it is who provides that grep. — Ed Morton, Aug 24 '23 at 12:18

Ed Morton · Answer 1 · 2023-08-30T19:26:42.227

3

As mentioned in the rust answer you linked "Grep runs out of memory even on a machine with 768 GB of RAM!" so I suspect you're having the same problem.

Using GNU awk for multi-char RS:

awk -v RS='<wd:Report_Entry>' 'END{print (NR ? NR : 1) - (RT ? 0 : 1)}' file

With the above we're counting the number of whatever...<wd:Report_Entry> "records" in the input. The (NR ? NR : 1) is to ensure we don't end up with -1 for an empty input file after the subsequent subtraction. The - (NR ? 1 : 0) is so we don't count the string after the last <wd:Report_Entry> in the input (input foo...<wd:Report_Entry>...bar should report 1, not 2).

Since the above is reading each <wd:Report_Entry>-separated string one at a time it will handle very large files containing multiple <wd:Report_Entry>s better than grep -o '<wd:Report_Entry> which apparently tries to read the whole input into memory at once and then look for matches.

edited Aug 30 '23 at 19:26

answered Aug 24 '23 at 12:17

Ed Morton

188,023
17
78
185

That giant file has no line breaks, so yeah, grep is slurping it all. Arguably it should support some sort of change to the record separator. – stevesliva Aug 30 '23 at 17:45
Gnu does - `-z` :-). grep doesn't need that because we have awk. – Ed Morton Aug 30 '23 at 17:53
This `-c` is twined with [grep not supporting](https://stackoverflow.com/questions/49268581/counting-occurrences-of-a-specific-number/49268736#49268736) `-co` to give string count. So `grep -ow | wc -l` gives the right number while `grep -cow` gives a number of lines. Sigh. So, *two* features that grep should support. – stevesliva Aug 30 '23 at 18:37

grep command for counting no of occurrences of a string in a file giving lesser number

1 Answers1

Linked