11

I'm sure this has been asked but I can't find it so my apologies for redundancy.

I want to use grep or egrep to find every line that has either ' P ' or ' CA ' in them and pipe them to a new file. I can easily do it with one or the other using:

egrep ' CA ' all.pdb > CA.pdb

or

egrep ' P ' all.pdb > P.pdb

I'm new to regex so I'm not sure the syntax for or.

Update: The order of the output lines is important, i.e. I do not want the output to sort the lines by which string it matched. Here is an example of the first 8 lines of one file:

ATOM      1 N    THR U  27     -68.535  88.128 -17.857  1.00  0.00      1H5  N  
ATOM      2 HT1  THR U  27     -69.437  88.216 -17.434  0.00  0.00      1H5  H  
ATOM      3 HT2  THR U  27     -68.270  87.165 -17.902  0.00  0.00      1H5  H  
ATOM      4 HT3  THR U  27     -68.551  88.520 -18.777  0.00  0.00      1H5  H  
ATOM      5 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C  
ATOM      6 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P  
ATOM      8 HB   THR U  27     -68.543  88.566 -15.171  0.00  0.00      1H5  H  
ATOM      9 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C  
ATOM     10 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P  
ATOM     11 HB   THR U  27     -68.543  88.566 -15.171  0.00  0.00      1H5  H  
ATOM     12 C    SER D   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 C  
ATOM     13 OP1  SER D   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 O  

and I want the result file for this example to be:

ATOM      5 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C  
ATOM      6 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P  
ATOM      9 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C  
ATOM     10 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P  
Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
  • this one http://stackoverflow.com/questions/13610642/using-grep-for-multiple-search-patterns – Avinash Raj May 29 '15 at 13:08
  • @AvinashRaj you can just cast a close vote once, true. That's why it is important to select it properly (it happens to me also!). I haven't casted any, so once you reopen it I can cast to close as dup – fedorqui May 29 '15 at 13:12
  • @AvinashRaj, I apologize for the duplicate question. Perhaps my question will help other find the answer, either here or at the link you shared. Should I delete my question since it is a duplicate or simply select the "That solved my problem!"? – Steven C. Howell May 29 '15 at 15:22
  • 1
    Searching Google for "regex match either string" might also help, showing http://www.regular-expressions.info/alternation.html and http://stackoverflow.com/questions/1188529/how-do-you-match-one-of-two-words-in-a-regular-expression near the top, for example. – Dave Newton May 29 '15 at 15:44

3 Answers3

21

You can use grep like this:

grep ' P \| CA ' file > new_file

The | expression indicates "or". We have to escape it in order to tell grep that it has a special meaning.

You can avoid this escaping and using something fancier with an extended grep:

grep -E ' (P|CA) ' file > new_file

In general, I prefer the awk syntax, since it is more clear and easier to extend:

awk '/ P / || / CA /' file

Or given your sample input, you can use awk to check if it is in the 3rd column when this happens:

$ awk '$3=="CA" || $3=="P"' file
ATOM      5 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C
ATOM      6 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P
ATOM      9 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C
ATOM     10 P    THY J   2     -73.656  70.884  -7.805  1.00  0.00      DNA2 P

Test

$ cat file
hello P is here and CA also
but CA appears
nothing here
P CA
$ grep ' P \| CA ' file
hello P is here and CA also
but CA appears
$ grep -E ' (P|CA) ' file
hello P is here and CA also
but CA appears
$ awk '/ P / || / CA /' file
hello P is here and CA also
but CA appears
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • 1
    @stvn your update failed to work if `P` present at the start. Could you elaborate your question? – Avinash Raj May 29 '15 at 12:54
  • @AvinashRaj this is not a duplicate of that question. There they ask for AND, whereas here it is OR – fedorqui May 29 '15 at 12:59
  • @fedorqui linked question part is the answer for this question :-) . And also he fails to explain the question I asked above.. And also he failed to show his attempts, since there are tons of question like this on SO. – Avinash Raj May 29 '15 at 13:01
  • @AvinashRaj I disagree. The linked question says "But that's matching lines that contains string1 OR string2.". Here the OP says "... find every line that has either P or CA ". So one is AND, here it is OR. Hence, they are completely different questions and the answers there do not apply here. – fedorqui May 29 '15 at 13:03
  • @AvinashRaj, I do not understand what you mean by "[my] update failed to work if ` P ` present at the start. If you mean my string will exclude instances when `P` is at the beginning of the line, that is my intention. I specifically want the search string to have at least one white space character before and after either CA or P, so ' CA ' or ' P '. The exact string is not the important part of the question but I will update the question to clarify. – Steven C. Howell May 29 '15 at 15:31
  • 1
    @fedorqui, I am glad to learn about awk as it seems an equally good option. – Steven C. Howell May 29 '15 at 15:56
  • @stvn66 nice! `awk` is great if you want to do some complex logic that can get tricky with `grep`. See also my edit to match your sample input, using the field number. – fedorqui May 30 '15 at 10:18
  • @fedorqui, is there perhaps a simple way to renumber the numbers in the second column to increment from 1? If you notice in the example lines that is the case. It is not critical for what I am doing but would mean the output files are more correct. – Steven C. Howell May 31 '15 at 02:49
  • @stvn66 you can say `awk '$3=="CA" || $3=="P" {$2=++i; print}' file`. Note this will break the current format, so that you can pipe to `column -t` to get it "nice" again. – fedorqui Jun 01 '15 at 07:18
0

Next command will search in all files that exists in directory /path_to_your_dir/ and output log to /tmp/grep.log:

grep 'P|CA' -Er /path_to_your_dir/ > /tmp/grep.log

If you need case insensitive, replace -Er to -Eri.
In file /tmp/grep.log you will see path to file and matched string.
if you need search in files with specific extension then write something like:

grep 'P|CA' -Er --include=*.php /path_to_your_dir/ > /tmp/grep.log

Hope it will help you.

cn007b
  • 16,596
  • 7
  • 59
  • 74
0

On Mac OS Ventura, the following does the trick.

grep -e ' CA ' -e ' P ' all.pdb > CA.pdb

From the man page of grep

-e pattern, --regexp=pattern Specify a pattern used during the search of the input: an input line is selected if it matches any of the specified patterns. This option is most useful when multiple -e options are used to specify multiple patterns, or when a pattern begins with a dash (‘-’).

Raghuram
  • 51,854
  • 11
  • 110
  • 122