5

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.

UPDATE: More information

The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...

One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

Keith W. Larson
  • 1,543
  • 2
  • 19
  • 34
  • 10
    Why do you need to open it in a text editor? Surely you're not going to comment all 250k records by hand? Look at using awk or sed. – Joshua Ulrich Jun 03 '13 at 15:47
  • Find a pattern for those bad records and solve the problem with awk or sed, as @Joshua indicates. 250k records to be checked manually mean a lifetime. – fedorqui Jun 03 '13 at 15:50
  • We tried to load the file in Notepad ++ and it took over 24 hours to load and was basically use. – Keith W. Larson Jun 03 '13 at 15:56
  • Just found a thread that can be useful: http://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files – fedorqui Jun 03 '13 at 15:58
  • If they are 'records' I would assume the data is in columns. If the file is .txt you could import it to SQL Server and run queries to identify and update your bad records. – AxGryndr Jun 03 '13 at 16:00
  • 5
    350,000,000,000 bytes with 250,000 records is about a megabyte PER LINE. This is not a job for a text file. Get a proper database. Just sayin'. – Spacedman Jun 03 '13 at 16:00
  • this is genomics data... we are currently just hacks at this aspect of bioinformatics. Any suggestions are welcome! – Keith W. Larson Jun 03 '13 at 16:06
  • is there an option to Popoolation, or could you ask the developers to consider one, that would skip 'bad' lines rather than terminating? (You may not want to say "crashing", developers are touchy about semantics that way ... ?) – Ben Bolker Jun 03 '13 at 19:53

4 Answers4

11

Based on your update:

One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:

sed 'LINE_NUMBER s/^/*/' file

See an example:

$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee

If you add -i, the file will be updated:

$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee

Even though I always think it's better to do a redirection to another file

sed '3 s/^/*/' file > new_file

so that you keep intact your original file and save the updated one in new_file.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
6

If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.

split -a4 -d -l100000 hugefile.txt part.

This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:

cat part.* > new_hugefile.txt
Markku K.
  • 3,840
  • 19
  • 20
4

The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
4

A basic pattern in R is to read the data in chunks, edit, and write out

fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
    ## txt is now 1000000 lines, add an asterix to problem lines
    ## bad = <create logical vector indicating bad lines here>
    ## txt[bad] = paste0("*", txt[bad])
    writeLines(txt, fout)
}
close(fin); close(fout)

While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Community
  • 1
  • 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112