The double quotes around the pipeline in your question are a syntax error, and the backslashed single quotes should apparently really not have backslashes, but I assume it's just an artefact of a slightly imprecise presentation.
Here's what the code does.
cat document.xml |
This is a useless use of cat
but its purpose is to feed the contents of this file into the pipeline.
grep '<w:t' |
This looks for lines containing the literal string <w:t
(probably the start of a tag in the XML format in the file). The single quotes quote the string so that it is not interpreted by the shell (otherwise the <
would be interpreted as a redirection operator); they are consumed by the shell, and not passed through to grep
.
sed 's/<[^<]*>//g' |
This replaces every pair of open/close brokets with an empty string. The regular expression [^<]*
matches zero or more occurrences of a character which can be anything except <
. If the XML is well-formed, these should always occur in pairs, and so we effectively remove all XML tags.
grep -v '^[[:space:]]*$'
This removes any line which is empty or consists entirely of whitespace.
Because sed
is a superset of grep
, the program could easily be rephrased as a single sed
script. Perhaps the easiest solution for your immediate problem would be to obtain a copy of sed
for your platform.
sed -e '/<w:t/!d' -e 's/<[^<]*>//g' -e '/[^[:space]]/!d' document.xml
I understand quoting rules on Windows may be different; try with double quotes instead of single, or put the script in a file and use sed -f file document.xml
where file
contains the script itself, like this:
/<w:t/!d
s/<[^<]*>//g
/[^[:space]]/!d
This is a rather crude way to extract the CDATA from an XML document, anyway; perhaps some XML processor would be the proper way forward. E.g. xmlstarlet
appears to be available for Windows. It works even if the XML input doesn't have the beginning and ending <w:t>
tags on the same line, with nothing else on it. (In fact, parsing XML with line-oriented tools is a massive antipattern.)