grep all characters including newline

Question

I'm parsing an XML file with

"lalala it's a Sunday {{ Some words here, maybe
a new line }} oh boy"

How would I use grep to get everything within "{{" and "}}" given that the grep . character doesn't recognize newlines?

Currently I have

grep '{{.*}}'

but it only works on things that are on the same line.

Jesse Cohen · Accepted Answer · 2011-02-20T18:47:55.960

8

One option is to remove the newline and then grep, as in:

 cat myfile | tr -d '\n' | grep {{.*}}

But if you say this is an XML file, why not use an XML parser that takes advantage of the file's inherent structure rather than just regexp?

EDIT

Grep regexp are greedy, you can use perl regexp:

cat myfile | tr -d '\n' | perl -pe 's/.*?({{.*?}})/\1\n/g' | grep {{

This should output one match per line. If you have nested {{ then this will get even more complicated.

edited Feb 20 '11 at 18:47

answered Feb 20 '11 at 17:33

Jesse Cohen

1

It does the cat but now the grep doesn't work - it returns the entire file. What gives? – Rio Feb 20 '11 at 18:01

score 2 · Answer 2 · answered Jun 11 '11 at 02:02

2

This is the way i solved that problem

   grep '{{[\s\S]*}}'

answered Jun 11 '11 at 02:02

Yuri Barbashov

2

`\s` and `\S` are PCRE extensions, not available in standard grep. – Charles Duffy Feb 21 '13 at 20:40

score 1 · Answer 3 · edited May 23 '17 at 12:33

1

You can use alternation between mutually exclusive character sets to match truly any character. For example, this command:

grep -E "\{\{([[:digit:]]|[^[:digit:]])+\}\}"

...will match anything (greedily) between the first {{ and last }}.

But as @JesseCohen states, you really, really, really should be parsing XML with an XML parser, not regexps.

edited May 23 '17 at 12:33

Community

answered Feb 20 '11 at 18:03

Phrogz

If you must know, I'm trying to extract parts of a wikipedia dump XML file that contains unstructured data (all of the above potentially contained within ``). So I think the XML parsing is a bit less relevant here. – Rio Feb 20 '11 at 18:24
Wow, I did just that (the wikipedia dump thing). You might find it a lot harder than it seems (at least I did). – Noam Feb 20 '11 at 19:11
Moreover, I think using a XML parser requires to load all the file at once, and that Wiki dump is HUGE. – Noam Feb 20 '11 at 21:02
@Noam Not if it's a streaming SAX parser, e.g. http://nokogiri.org/Nokogiri/XML/SAX.html – Phrogz Feb 20 '11 at 21:11

score 0 · Answer 4 · answered Dec 07 '18 at 16:06

0

This worked for me:

grep -zo '[[:cntrl:][:print:]]'

answered Dec 07 '18 at 16:06

Peter K

4 Answers4