1

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:

Using

$ grep -i -n "GACGGCT" grep3.txt 

To search the file grep3.txt (I put the target 'GACGGCT's in double stars)

GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

Returns

3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC

So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.

How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
Jason G
  • 916
  • 9
  • 14
  • 2
    How do you know where the sequences start and stop? For example can a sequence be only 40 characters then break after the 40 character sequence. If you ignore line breaks then grep will just return the whole file as a single found entry. – Mark Meyer Sep 19 '12 at 18:31
  • +1 to the comment above; also, the `grep` results seem rather meaningless, as they represent random parts of a sequence (unless the whole file is a single sequence). – Lev Levitsky Sep 19 '12 at 18:33
  • 1
    If the file contains a single string you could combine the lines by removing the \n, eg with `tr -d '\\n' < inputfile >tempfile` – wildplasser Sep 19 '12 at 19:13
  • is the question then 'does this file contain the target sequence?', or do you really need to see some context for the line the data is embedded in? If just trying to find files with the target sequence, use @wildplasser 's technique to "flatten" out the file. Otherwise, unix tools (sed, awk, grep), are line oriented tools. YOu're making them jump through hoops to process your clunky data. Any chance of fixing the source? Good luck. – shellter Sep 19 '12 at 20:15
  • 1
    I do not want to alter the files, nor do I wish to create a new file without the line breaks. I already have hundreds of files that take up terabytes of disk space, duplicating them would not be worth it and changing them will make them unusable by most programs. I think shellter and NuclearGhost have made it clear from their descriptions that (grep, sed, awk) is/are not the tool(s) I need for this job.... That being said, does anyone know of a unix terminal controlled data mining tool? – Jason G Sep 19 '12 at 20:40
  • Also, I'm a total stackoverflow.com newb. How does one 'upvote' a response? – Jason G Sep 19 '12 at 20:41
  • 1
    More accurately, I would like to count the number of times that a particular sequence occurs in a file, and I would like the tool to behave as if there were no linebreaks in the file (even though there are linebreaks) ; as if consecutive lines were concatenated and the whole file is on one line. – Jason G Sep 19 '12 at 20:50
  • Well, it is not very hard (bot not trivial either) to construct a DFA that scans the file, ignoring the line breaks. (f)lex may be a start to construct the DFA. I once posted a script here to generate a flex script to do that (searching for *multiple* (non overlapping) patterns in one pass) – wildplasser Sep 19 '12 at 20:56
  • 1
    http://stackoverflow.com/a/8713849/905902 here is the link. IIRC, flex has the ability to replace the getc() by a user-supplied function, which in your case could be used to skip the embedded newlines (and increment the line counter) – wildplasser Sep 19 '12 at 22:28
  • @KasonG: To upvote, you need at least 15 reputation. See [faq](http://stackoverflow.com/faq#reputation) – AndrewC Sep 19 '12 at 22:47
  • Almost right: to upvote you must be registered and logged in and have at least 15 reputation. – wildplasser Sep 19 '12 at 22:51
  • @potong Did you miss this one? If you can't do it in sed no one can. JasonG : P.S. Welcome to StackOverflow. Please remember to accept the answer that best solves your problem, if any, by pressing the checkmark sign, http://i.imgur.com/uqJeW.png . When you see good Q&A, vote them up by using the gray triangles, http://i.imgur.com/kygEP.png . Note that 'giving' reputation points to others does not mean a deduction to your reputation points (unless you have posted a bounty). – shellter Sep 20 '12 at 02:24

2 Answers2

0

I assume that your each line is 60 char long. Then the below cmd should work

tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"

output :

1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
user1011046
  • 204
  • 1
  • 5
  • 16
0
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
lanes
  • 1,847
  • 2
  • 19
  • 19