I have some very weird data corruption trouble recently. Basically what I do is:
- transfer some large data (50files, each around 8GB) from one server to hpcc(high performance computing) using "scp"
- Process each line of input files, and then append/write those modified lines to output files. And I do this on hpcc by "qsub -t 1-1000 xxx.sh", that is throwing out all 1000 jobs at the same time. Also these 1000 jobs are on average using 4GB of memory each.
The basic format of my script is:
f=open(file)
for line in f:
#process lines
or
f=open(file).readlines()
#process lines
However, weird part is: from time to time, I can see data corruption in some parts of my data.
First, I just find some of my "input" data is corrupted (not ALL); then I just doubt if it's the problem of "scp". I ask some computer guys, and also post here, but seems there's very little possibility that 'scp' can distort the data. And I just do "scp" to transfer my data again to hpcc; and the input data this time becomes ok. weird, right? So this propels me to think: is it possible that input data maybe disrupted by being used to run memory/CPU usage-intensive programs?
If input data is corrupted, it's very natural that output is also corrupted. Ok, then I transfer the input data again to hpcc, and check that all of them are in good-shape, I then run programs (should point out:run 1000 jobs together), and the output files...most of them are good; however very surprisingly, some portion of only one file are corrupted! So for I just singly run program for this specific file again, then get good output without any corruption!! I'm so confused......After seeing so many weird things, my only conclusion is: maybe running many memory-intensive jobs at the same time will harm the data? (But I used to also run lots of such jobs, and seems ok)
And by data corruption, I mean:
Something like this:
CTTGTTACCCAGTTCCAAAG9583gfg1131CCGGATGCTGAATGGCACGTTTACAATCCTTTAGCTAGACACAAAAGTTCTCCAAGTCCCCACCAGATTAGCTAGACACAGAGGGCTGGTTGGTGCATCT0/1
gfgggfgggggggggggggg9583gfg1131CCGGAfffffffaedeffdfffeffff`fffffffffcafffeedffbfbb[aUdb\``ce]aafeeee\_dcdcWe[eeffd\ebaM_cYKU]\a\Wcc0/1
CTTGTTACCCAGTTCCAAAG9667gfg1137CCGGATCTTAAAACCATGCTGAGGGTTACAAA1AGAAAGTTAACGGGATGCTGATGTGGACTGTGCAAATCGTTAACATACTGAAAACCTCT0/1
gfgggfgggggggggggggg9667gfg1137CCGGAeeeeeeeaeeb`ed`dadddeebeeedY_dSeeecee_eaeaeeeeeZeedceadeeXbd`RcJdcbc^c^e`cQ]a_]Z_Z^ZZT^0/1
However it should be like:
@HWI-ST150_0140:6:2204:16666:85719#0/1
TGGGCTAAAAGGATAAGGGAGGGTGAAGAGAGGATCTGGGTGAACACACAAGAGGCTTAAAGCATTTTATCAAATCCCAATTCTGTTTACTAGCTGTGTGA
+HWI-ST150_0140:6:2204:16666:85719#0/1
gggggggggggggggggfgggggZgeffffgggeeggegg^ggegeggggaeededecegffbYdeedffgggdedffc_ffcffeedeffccdffafdfe
@HWI-ST150_0140:6:2204:16743:85724#0/1
GCCCCCAGCACAAAGCCTGAGCTCAGGGGTCTAGGAGTAGGATGGGTGGTCTCAGATTCCCCATGACCCTGGAGCTCAGAACCAATTCTTTGCTTTTCTGT
+HWI-ST150_0140:6:2204:16743:85724#0/1
ffgggggggfgeggfefggeegfggggggeffefeegcgggeeeeebddZggeeeaeed[ffe^eTaedddc^Oacccccggge\edde_abcaMcccbaf
@HWI-ST150_0140:6:2204:16627:85726#0/1
CCCCCATAGTAGATGGGCTGGGAGCAGTAGGGCCACATGTAGGGACACTCAGTCAGATCTATGTAGCTGGGGCTCAAACTGAAATAAAGAATACAGTGGTA