1

Recently I transferred a set of data from one server to hpcc(high-performance computing) Command is like:

scp /folder1/*.fastq.gz xxx@hpcc:/home/
scp /folder2/*.fastq.gz xxx@hpcc:/home/
scp /folder3/*.fastq.gz xxx@hpcc:/home/

I open several terminals to transfer the data at the same time. And in total I have ~50 such fastq.gz files, each around 10GB. I'm just wondering is there any possibility that data(esp. such large data) will be distorted when being transferred in the way mentioned above?

Because data on the server is in good-shape; while some data after being copied to hpcc is distorted.

thx thx

Oscar Foley
  • 6,817
  • 8
  • 57
  • 90
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92
  • I guess gz means the files are compressed before transmission. Is it the compressed file which becomes corrupted, or is it its content? – kol Dec 09 '11 at 00:39
  • the file before transmission is already compressed(i check using 'zcat', it's in good shape); I just transfer the compressed form directly to hpcc..........and when I check using "zcat", it's corrupted – LookIntoEast Dec 09 '11 at 00:44
  • Check `md5sum(1)` output on both endpoints; `md5sum * > /tmp/sums` on the source system, copy `/tmp/sums` to `hpcc` and run `md5sum -c /tmp/sums` to _really_ find out which ones are different. – sarnold Dec 09 '11 at 00:58
  • just FYI to people looking for answers when their SCP seems to be copying files wrong: maybe our disk in the destination server has bad blocks and scp is saving the file in a location that uses these bad blocks. (it is what happened to me, took me a long time to realize) – msb Jan 20 '23 at 20:10

2 Answers2

5

I strongly doubt that your data was corrupted in transit by scp(1).

TCP provides a (weak) 16 bit CRC checksum of traffic streams. Because it is only sixteen bits long, relying upon TCP for data integrity means corrupted packets will still validate roughly one every (2^16) corrupted packets. I've long since lost the link (and the math), but vaguely recall that means corrupted data will be validated as correct once every two to four gigabytes across the public Internet -- though those numbers relied upon a specific error introduction rate at the time I read that statistic.

SSH Version 2 introduced Message Authentication Checks into the protocol. These are negotiated between peers, but I expect the weakest allowed would be MD5, which provides for a 128 bit cryptographic hash of the data. Cryptographic hashes are far more advanced than the Cyclic Redundancy Checks that were more common for detecting data transmission errors two decades ago, and 128 bits is a significant expansion in checksum size. We might not trust MD5 enough to rely on it exclusively these days for resistance against dedicated attackers but it should be sufficient for discovering errors that happen by mistake in all but the most incredible circumstances.

I would look elsewhere for your corruption -- first and foremost, the destination drives where you stored your data.

sarnold
  • 102,305
  • 22
  • 181
  • 238
1

I know this is an ancient question, but I don't think scp could be responsible either; my guess is filename collision.

You stated that you had several scp copies running at the same time. The commands pasted above will copy the contents of /folder1, /folder2 and /folder3 into /home. If you had two files with the same filename, e.g.

/folder1/argle.fastq.gz
/folder1/bargle.fastq.gz    
/folder2/argle.fastq.gz

then you'll have a filename collision on /home. Since scp will happily overwrite files on dest and I don't think it throws a lock on files while it works, copying two different files with the same name to the same place could easily result in a corrupt file.

Robert Calhoun
  • 4,823
  • 1
  • 38
  • 34