1

I've a 6GB XML file that has only one line (verified with wc -l file.xml)

This is the command I'm using : grep -o '<wd:Report_Entry>' file.xml | wc -l and it's outputting 446441. This is supposed to be the right command as mentioned at https://stackoverflow.com/a/14510665/5524175.

The correct count is 1521620. Surprisingly, this rust solution gives the right count. count_occurences '<wd:Report_Entry>' file.xml gives 1521620.

Also, the following command mentioned in this accepted answer also gives 446441.

sed 's/<wd:Report_Entry>/<wd:Report_Entry>\n/g' file.xml | grep -c "<wd:Report_Entry>"

I'm not sure what I'm missing. Escape characters like < or > or :? I'm on macOS. This is my grep version.

➜  ~ grep --version
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
duplex143
  • 619
  • 2
  • 9
  • 25
  • 2
    Can you reproduce with a smaller file? E.g. with a file you create manually with a few `` here and there? – norbjd Aug 24 '23 at 09:44
  • It is a recurrent venue here. People asking for way to parse XML with sed or grep or awk. Dudes, you are basically trying to open a lock with a screwdriver. You may be able to with enough brute-force if you don't mind the resulting mess. – Léa Gris Aug 24 '23 at 10:33
  • Your `grep` pattern does not use any specific regex characters. For the safe side, you could also use the grep option `-F`, to disable all regex matching, or do a `grep -o ...|less` and check for empty lines or other anomalities. Perhaps the string you are looking for, also occurs in some XML comments, which are ignored with the other methods you tried? – user1934428 Aug 24 '23 at 10:40
  • 1
    This might be related [(SU) Is there a limit for a line length for grep command to process correctly?](https://superuser.com/questions/1703029/). It states that the line-length that GNU grep can handle is limited by the memory of the system. – kvantour Aug 24 '23 at 10:41
  • [The rust answer you linked](https://stackoverflow.com/a/58994637/1745001) gives you a clue - "Grep runs out of memory even on a machine with 768 GB of RAM!". What does `awk -v RS='' 'END{print NR-1}' file.xml` **using GNU awk for multi-char RS** (so not the default BSD awk on MacOS - you'll have to install GNU awk) output? There's nothing special about `<` or `>` in a regexp by the way, they're just literal characters as long as you don't put a ```\``` in front of them as that'd turn them into word boundaries in some regexp engines. – Ed Morton Aug 24 '23 at 11:58
  • Wow. @EdMorton, your command worked! It outputs `1521620`. grep ran out of memory in my linux server but in my mac, it didn't give any error. It should've given some error instead of giving `446441 ` – duplex143 Aug 24 '23 at 12:13
  • Good, I posted a more robust version of that gawk script with an explanation of it as [an answer](https://stackoverflow.com/a/76969187/1745001) now. I expect you could report that grep issue of no error message to the BSD folks or whoever it is who provides that grep. – Ed Morton Aug 24 '23 at 12:18

1 Answers1

3

As mentioned in the rust answer you linked "Grep runs out of memory even on a machine with 768 GB of RAM!" so I suspect you're having the same problem.

Using GNU awk for multi-char RS:

awk -v RS='<wd:Report_Entry>' 'END{print (NR ? NR : 1) - (RT ? 0 : 1)}' file

With the above we're counting the number of whatever...<wd:Report_Entry> "records" in the input. The (NR ? NR : 1) is to ensure we don't end up with -1 for an empty input file after the subsequent subtraction. The - (NR ? 1 : 0) is so we don't count the string after the last <wd:Report_Entry> in the input (input foo...<wd:Report_Entry>...bar should report 1, not 2).

Since the above is reading each <wd:Report_Entry>-separated string one at a time it will handle very large files containing multiple <wd:Report_Entry>s better than grep -o '<wd:Report_Entry> which apparently tries to read the whole input into memory at once and then look for matches.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • That giant file has no line breaks, so yeah, grep is slurping it all. Arguably it should support some sort of change to the record separator. – stevesliva Aug 30 '23 at 17:45
  • Gnu does - `-z` :-). grep doesn't need that because we have awk. – Ed Morton Aug 30 '23 at 17:53
  • This `-c` is twined with [grep not supporting](https://stackoverflow.com/questions/49268581/counting-occurrences-of-a-specific-number/49268736#49268736) `-co` to give string count. So `grep -ow | wc -l` gives the right number while `grep -cow` gives a number of lines. Sigh. So, *two* features that grep should support. – stevesliva Aug 30 '23 at 18:37