14

I am basically grepping with a regular expression on. In the output, I would like to see only the strings that match my reg exp.

In a bunch of XML files (mostly they are single-line files with huge amounts of data in a line), I would like to get all the words that start with MAIL_.

Also, I would like the grep command on the shell to give only the words that matched and not the entire line (which is the entire file in this case).

How do I do this?

I have tried

grep -Gril MAIL_* .
grep -Grio MAIL_* .
grep -Gro MAIL_* .
TRiG
  • 10,148
  • 7
  • 57
  • 107
AMM
  • 17,130
  • 24
  • 65
  • 77

4 Answers4

18

First of all, with GNU grep that is installed with Ubuntu, -G flag (use basic regexp) is the default, so you can omit it, but, even better, use extended regexp with -E.

-r flag means recursive search within files of a directory, this is what you need.

And, you are right to use -o flag to print matching part of a line. Also, to omit file names you will need a -h flag.

The only mistake you made is the regular expression itself. You missed character specification before *. Your command should look like this:

grep -Ehro 'MAIL_[^[:space:]]*' .

Sample output (not recursive):

$ echo "Some garbage MAIL_OPTION comes MAIL_VALUE here" | grep -Eho 'MAIL_[^[:space:]]*'
MAIL_OPTION
MAIL_VALUE
thor
  • 2,204
  • 3
  • 20
  • 23
  • great..that works, but one quick question how do i do if i know the MAIL_* stuff are either present as type="MAIL_*" or >MAIL_*< in the files? any help on that one? – AMM Aug 06 '10 at 12:48
  • I don't get it. Could you rephrase your question? You want to see surrounding characters around your MAIL_XXX stuff? Like, you want to see " and <> in output of grep command? – thor Aug 06 '10 at 12:51
  • if your MAIL_* could only contain alphabetic characters (a-z), then you can change regexp to 'MAIL_[[:alpha:]]*' – thor Aug 06 '10 at 13:02
6

Try the following command

grep -Eo 'MAIL_[[:alnum:]_]*'
banx
  • 4,376
  • 4
  • 30
  • 34
2
grep -o or --only-matching

outputs only the matching text instead of complete lines but the problem could be your regex that's not restrictive or greedy enough and actually matches the whole file.

chocolate_jesus
  • 101
  • 1
  • 9
  • now the type of words i want are present like this in the file type="MAIL_ABC_CDE" type="MAIL_XXX_AAA_AAA" etc there can be any number of _'s WHat should be the reg exp i shoudl use? any idea on that? – AMM Aug 06 '10 at 12:42
0

From your comment to Thor's answer it seems you also want to distinguish if the MAIL_.* text is a text node or an attribute, not just to isolate it whenever it appears in the XML document. Grep cannot parse XML, you need a proper XML parser for that.

A command line xml parser is xmlstarlet. It is packaged in Ubuntu.

Using it on this example file example file:

$ cat test.xml 
<some_root>
    <test a="MAIL_as_attribute">will be printed if you want matching attributes</test>
    <bar>MAIL_as_text will be printed if you want matching text nodes</bar>
    <MAIL_will_not_be_printed>abc</MAIL_will_not_be_printed>
</some_root>

For selecting text nodes you can use:

$ xmlstarlet sel -t -m '//*' -v 'text()' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_text

And for selecting attributes:

$ xmlstarlet sel -t -m '//*[@*]' -v '@*' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_attribute

Brief explanations:

  • //* is an XPath expression that selects all elements in the document and text() outputs the value of their children text nodes, therefore everything except text nodes gets filtered out
  • //*[@*] is an XPath expression that selects all attributes in the document and then @* outputs their value
Community
  • 1
  • 1
Catalin Iacob
  • 644
  • 5
  • 18