1

I have some very weird data corruption trouble recently. Basically what I do is:

  1. transfer some large data (50files, each around 8GB) from one server to hpcc(high performance computing) using "scp"
  2. Process each line of input files, and then append/write those modified lines to output files. And I do this on hpcc by "qsub -t 1-1000 xxx.sh", that is throwing out all 1000 jobs at the same time. Also these 1000 jobs are on average using 4GB of memory each.

The basic format of my script is:

f=open(file)
for line in f:
#process lines

or

f=open(file).readlines()
#process lines

However, weird part is: from time to time, I can see data corruption in some parts of my data.

  1. First, I just find some of my "input" data is corrupted (not ALL); then I just doubt if it's the problem of "scp". I ask some computer guys, and also post here, but seems there's very little possibility that 'scp' can distort the data. And I just do "scp" to transfer my data again to hpcc; and the input data this time becomes ok. weird, right? So this propels me to think: is it possible that input data maybe disrupted by being used to run memory/CPU usage-intensive programs?

  2. If input data is corrupted, it's very natural that output is also corrupted. Ok, then I transfer the input data again to hpcc, and check that all of them are in good-shape, I then run programs (should point out:run 1000 jobs together), and the output files...most of them are good; however very surprisingly, some portion of only one file are corrupted! So for I just singly run program for this specific file again, then get good output without any corruption!! I'm so confused......After seeing so many weird things, my only conclusion is: maybe running many memory-intensive jobs at the same time will harm the data? (But I used to also run lots of such jobs, and seems ok)

And by data corruption, I mean:

Something like this:

CTTGTTACCCAGTTCCAAAG9583gfg1131CCGGATGCTGAATGGCACGTTTACAATCCTTTAGCTAGACACAAAAGTTCTCCAAGTCCCCACCAGATTAGCTAGACACAGAGGGCTGGTTGGTGCATCT0/1
gfgggfgggggggggggggg9583gfg1131CCGGAfffffffaedeffdfffeffff`fffffffffcafffeedffbfbb[aUdb\``ce]aafeeee\_dcdcWe[eeffd\ebaM_cYKU]\a\Wcc0/1
CTTGTTACCCAGTTCCAAAG9667gfg1137CCGGATCTTAAAACCATGCTGAGGGTTACAAA1AGAAAGTTAACGGGATGCTGATGTGGACTGTGCAAATCGTTAACATACTGAAAACCTCT0/1
gfgggfgggggggggggggg9667gfg1137CCGGAeeeeeeeaeeb`ed`dadddeebeeedY_dSeeecee_eaeaeeeeeZeedceadeeXbd`RcJdcbc^c^e`cQ]a_]Z_Z^ZZT^0/1

However it should be like:

@HWI-ST150_0140:6:2204:16666:85719#0/1
TGGGCTAAAAGGATAAGGGAGGGTGAAGAGAGGATCTGGGTGAACACACAAGAGGCTTAAAGCATTTTATCAAATCCCAATTCTGTTTACTAGCTGTGTGA
+HWI-ST150_0140:6:2204:16666:85719#0/1
gggggggggggggggggfgggggZgeffffgggeeggegg^ggegeggggaeededecegffbYdeedffgggdedffc_ffcffeedeffccdffafdfe
@HWI-ST150_0140:6:2204:16743:85724#0/1
GCCCCCAGCACAAAGCCTGAGCTCAGGGGTCTAGGAGTAGGATGGGTGGTCTCAGATTCCCCATGACCCTGGAGCTCAGAACCAATTCTTTGCTTTTCTGT
+HWI-ST150_0140:6:2204:16743:85724#0/1
ffgggggggfgeggfefggeegfggggggeffefeegcgggeeeeebddZggeeeaeed[ffe^eTaedddc^Oacccccggge\edde_abcaMcccbaf
@HWI-ST150_0140:6:2204:16627:85726#0/1
CCCCCATAGTAGATGGGCTGGGAGCAGTAGGGCCACATGTAGGGACACTCAGTCAGATCTATGTAGCTGGGGCTCAAACTGAAATAAAGAATACAGTGGTA
Community
  • 1
  • 1
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92
  • Two debugging suggestions: 1. If you suspect corruption during transmission, use a utility like `md5sum` or `sha1sum` to generate a checksum of each file before you transmit it. Generate another checksum when you receive the file and ensure it's correct. 2. Since you know the format of your input files (alternating lines of identifiers and DNA sequences that must consist of the regex `[ACGT]+`), verify each line as you read it, before you process it. If the output format is similarly predictable, verify it before you write it. – Adam Liss Dec 11 '11 at 02:43
  • thx, this is a lesson to me. I'll definitely check data format everytime before process it. – LookIntoEast Dec 11 '11 at 02:48
  • I'm just curious, how such corruption comes? What's the possible cause of such corruption? – LookIntoEast Dec 11 '11 at 02:48
  • You may need to discover the source of the corruption in order to understand the cause. It can happen during transmission, storage, or processing; there's not enough information yet to know which. Another random thought: how old is the server's storage medium, and has it been checked recently? – Adam Liss Dec 11 '11 at 02:50
  • I mean,say, we take "process" as example: my script is the same for all jobs. However most of them work, only few of them got corrupted. Then how to explain this? Is it just like genetic mutation in biological science? Or is there some small possibility that computer will do errors? – LookIntoEast Dec 11 '11 at 02:57
  • If the error happens during processing, it may be data-dependent. In other words, it may happen only when the input contains a particular sequence. Maybe your program has antibodies to a particular digital protein. :-) – Adam Liss Dec 11 '11 at 03:12

0 Answers0