108

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.

An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.

I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.

Elijah
  • 1,814
  • 21
  • 27
Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179
  • As for the number of atoms, this depends on the composition. A and T are smaller molecules than G and C. The structure of the molecule is the beef, though, not its atomic composition, so this isn't really a very useful calculation. (For what it's worth, e.g. the A molecule aka [deoxyadenosine](https://en.wikipedia.org/wiki/Deoxyadenosine) is C10H13N5O3 so 31 atoms.) – tripleee Aug 30 '15 at 09:28
  • See also https://www.biostars.org/p/5514/ – Ondra Žižka Dec 02 '15 at 01:59
  • Except for users slayton, Paul Amstrong and rauchen all other answers given are dead wrong in its essence or far from complete. In the answers user (fail to) mentioned compression methods or is poorly explained. See my answer to clarify the 4 times downsizing of the genome as seen in many answers. – ZF007 Mar 01 '18 at 10:43
  • I'm voting to close this question as off-topic because it is off-topic here, should be on https://bioinformatics.stackexchange.com/ – Chris_Rands Nov 27 '19 at 08:53
  • 938 Megabytes compressed. Here is a [link to a repository](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/) containing it in a file called: hg38.chromFa.tar.gz – SurpriseDog May 27 '21 at 03:23
  • 1
    Vote to reopen because this is definitely not opinion based – Jonathan Sep 23 '21 at 19:54

11 Answers11

86

If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • 13
    Just to add some biological commentary, "haploid" here means only one copy of each chromosome. The human reference assembly is haploid (and a mosaic of multiple people). An actual individual genome will be diploid (2 copies of each chromosome, except X and Y) but again only variant between the two copies at a small subset of sites. – Alex Stoddard Jan 23 '12 at 19:58
  • 19
    Thought about it for a day, and realized this: If you stored some base case human DNA, any subsequent human's DNA would only need to be stored as the diff between it and the base case. For same sex examples DNA is 99.9% the same. And across sexes it's like 98.5%. – Costa Michailidis May 22 '15 at 15:14
  • 5
    Also worth to remember that not all information encoded within DNA base pairs there is also [epigenetic](https://en.wikipedia.org/wiki/Epigenetics) information. – Annarfych Jun 19 '17 at 06:45
  • 2
    this makes sense. base pairs are basically 4-nary. a 4-nary number is 2 bits, so double the size. so that's 5.8 gigabits or 5.8/8 gigabytes which is 0.725 GB or 725 MB. the 'compression' is only possible because you can store a diff against the mapped genome instead of storing your entire genome. – Dave Cousineau Oct 02 '17 at 04:49
  • 1
    @Annarfych This is extremely misleading since epigenetic information is, by definition, not inheritable (despite misguided claims to the contrary in the popular press). *Heritable* information is stored in the DNA only. – Konrad Rudolph Jul 02 '19 at 09:53
  • @KonradRudolph, That is incorrect. Epigenetic information is, by definition, inheritable. – cowlinator Jun 18 '20 at 20:10
  • 1
    @cowlinator Categorically no. And just to cut this short, I did research in epigenetics for my PhD and postdoc. For once I actually know what I’m talking about. – Konrad Rudolph Jun 18 '20 at 20:22
  • @KonradRudolph, so, epigenetics is not "the study of heritable phenotype changes that do not involve alterations in the DNA sequence"? What is epigenetics? (I ask because this is the definition used by wikipedia and merriam-webster) – cowlinator Jun 18 '20 at 21:14
  • 3
    @cowlinator These definitions are … bad. “Heritable” in this case means “heritable” *between dividing mother and daughter cells*, not heritable between multi-cellular organisms and their offspring (that would be *transgenerational* epigenetic inheritance, which exists but is incredibly rare, and most claimed cases of it are based on bad science and are generally not accepted by experts). But the person who wrote that sentence is probably not entirely clear on what they mean, because there’s no excuse for the sentence’s bad phrasing. Check out the the “talk” page of the Wikipedia article. – Konrad Rudolph Jun 18 '20 at 21:59
  • Does this also include non-coding dna? – James O'Brien May 25 '22 at 22:37
29

You do not store all the DNA in one stream, rather most the time it is store by chromosomes.

A large chromosome take about 300 MB and a small one about 50 MB.


I think the first reason why it is not saved in 2 bits per base pair is that it would cause a hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...

1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.

Another point is that the data isn't as simple as you get told.

e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...

Another example is the DNA methylation because you can't store this Information in a 2-bit representation.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
rauschen
  • 3,956
  • 2
  • 13
  • 13
  • 2
    +1 from me. However, I have no clue what does "large" or "small" chromosome mean? – Milan Babuškov Jan 23 '12 at 09:54
  • 2
    These numbers don't tally with what Wikipedia says (see the table at http://en.wikipedia.org/wiki/Human_genome#Information_content); I'm not saying you're wrong, but can you explain the discrepancy? – Oliver Charlesworth Jan 23 '12 at 11:25
  • It looks like he is quoting Mbp (million of base-pairs, each base-pair being a single position in the genome) rather than MB which can assume a 2-bit encoding of each position – Alex Stoddard Jan 23 '12 at 20:02
  • 1
    Some of a genome's DNA methylation changes over the lifetime of the organism. Including DNA methylation data for a human genome would be more like a detailed snapshot of a person at a particular moment, rather than a generic description of the individual. Although, the OP didn't specify which they wanted. – cowlinator Jun 18 '20 at 19:53
  • 1
    Why would you store the whole thing for every individual? 99% of DNA is the same between humans so you would only have to store each person's deviations from the average. – SurpriseDog May 27 '21 at 03:17
15

Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.

I'm no expert, however, the Human Genome page on Wikipedia states the following:

Raw MB:

  • Male (XY): 770MB
  • Female (XX): 756MB

I'm not sure where their variance comes from, but I'm sure you can figure it out.

Paul Armstrong
  • 7,008
  • 1
  • 22
  • 36
  • 6
    Realistically, more than 2 bits are required, as there are other bases stored in sequence information (`N`, for example, where data is not mappable and therefore unknown). The IUPAC nucleotide codes include more than the standard four, and this can increase storage overhead. http://www.ebi.ac.uk/2can/tutorials/aa.html – Alex Reynolds Jan 30 '12 at 08:37
  • @AlexReynolds broken link :/ – o0'. May 01 '15 at 13:13
  • 3
    @AlexReynolds @o0' http://www.bioinformatics.org/sms2/iupac.html is a better link for those IUPAC codes. AIUI, a particular genome "scan" needs more than 2 bits due to imprecision, thus `R` for either A or G, `N` for any base, `.` for a gap, etc. If we could read a genome perfectly, it would be just 2 bits per base. – skierpage Jan 12 '17 at 04:37
  • 1
    The X chromosome is single for females. Males have as extra the Y chrom. to be coded, which as we all know distinct from X crhom. – ZF007 Mar 01 '18 at 10:06
  • It also depends on how you define [Megabyte](https://en.wikipedia.org/wiki/Megabyte): binary 2^20 or metric 10^6 bytes. You use binary, so your number is lower. – il--ya Jul 06 '18 at 19:01
  • @ZF007 human females have TWO X chromosomes. Males have one X and one Y. – xbello Aug 24 '19 at 14:08
  • @xbello .. do you assume more than 100.000 bp difference between both X's to be relevant to include twice the information or can we assume its <1.000 bp? And thus neglect-able in the discussion?! – ZF007 Aug 24 '19 at 14:49
  • X chromosome carries about 5-6 million variants, so it's safe to say that a difference of 100K between them is probable. If you only want to store a haploid human genome, then yes, you can discard a whole X chromosome along with half the autosomes. But in the real world we have to store each variant AND its zygosity, to have a truly "lossless" storage. – xbello Aug 24 '19 at 16:48
  • ... in such a case we have to compensate for ... 44 more chromosomes. I reckon that it would become a different question because then you need to know if you need to keep in mind real substitutions (only protein level) or/and also RNA-fold level, etc.? So ... if you want to dig into that post a new question and toss a "@". – ZF007 Aug 25 '19 at 06:22
  • Since male DNA is shorter than female’s due to the fact that Y chromosome contains less genes than X (female: XX; male: XY), your calculation of Mb for male and female DNA seems to be swapped. – Alex Dec 04 '21 at 09:48
11

Yes, the minimum storage space needed for whole human DNA is about 770 MB.

However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore, some mathematicians designed more effective way to store those sequencies of bases and use them in searching and comparation algorithms. One such example is GARLI.

This application runs on my PC right now, and I have the human genome stored in 1563 MB.

Sagar Patil
  • 507
  • 2
  • 7
  • 18
6

The human genome contains over 3 billion base pairs. So if you represented each base pair as two bits then it would take over 6.15 × 10⁹ bits or approximately 770 MB.

Tikolu
  • 193
  • 2
  • 12
slayton
  • 20,123
  • 10
  • 60
  • 89
  • 1
    bits ~= bytes. 2.9 billion bits is around 350 MB – SDGuero Apr 22 '14 at 23:01
  • 7
    @SDGuero, base-pairs are base 4 not base 2, so you need at least 2 bits to represent a base pair. – slayton Apr 24 '14 at 13:41
  • BS on the bit lingo... each nucleotide base is 1 character and thus 1 byte, regardless of character conversion table (AscII, UTF-8, etc) used; not including 2byte Asian coding. – ZF007 Mar 01 '18 at 10:10
  • 5
    @zf007 Base pairs are represented by the TOKENS of a, c, g and t. A token is not the same as a character. There is no reason a can't be encoded as 00, c as 01, g as 10 and t as 11 – MatBailie Dec 18 '19 at 02:01
  • @MatBailie.. Please elaborate and include the point you want to make in your comment because for now it missing. Have you read my answer that addresses the coding style ('A' as 1 byte or 'ATCG' or any other quadruplet represented in 1 byte) and the requirement of having DNA string humanly readable in fasta files? – ZF007 Dec 18 '19 at 10:11
  • 5
    There's the discrepancy ; you're asserting the need for a human readable file, which is not in the original post. – MatBailie Dec 19 '19 at 11:46
4

just did it too. the raw sequence is ~700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated ~120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.

betheguest
  • 41
  • 2
3

There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence?!?

1

Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.

The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not ~750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.

Haploid = single copy of a chromosome. Diploid = two versions of haploid. Humans have 22 unique chromosomes x 2 = 44. Male 23rd chromosome is X, Y and makes 46 in total. Females 23rd chrom. is X, X and thus makes 46 in total.

For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.

Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC"; 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.

If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:

A = 00
C = 01
G = 10
T = 11

Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb... ~860MB including a then required conversion program (23kb-4mb).

But... in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.

ZF007
  • 3,708
  • 8
  • 29
  • 48
  • 2
    You can as well store it as a pictire or audio recording, or even video - and it will take terabates to store. But this is not *required* and *minimal*, as it was asked. – il--ya Jul 06 '18 at 18:51
  • 1
    @il--ya... I'm missing the point you try to make... (I guess you like moving around 250km of TDK tape.. weighing 600kg and takes three hours to rewind)? – ZF007 Jul 09 '18 at 14:11
  • 3
    The point is, that 1 out of 4 base pairs are coded with 2 bits of information. This is how much data is *required* to code it - you cannot code with less. But you may choose to code it in a different way: you may use a whole byte, or draw a picture which takes few kB, or make an audio recording. All this would still allow to store required information, but that would not be *required* or *minimal* coding. You arbitrarily imposed readability criteria (using standard text editor), which is not what was asked in original question. – il--ya Jul 11 '18 at 11:09
  • 1
    That is unfortunately not how it works in biology. The method of communication between scientists is either verbally, paper or textfile-formats that can easily be read from a screen. In the case you have one base-pairs, filling a byte with zeros or ones will suffice. However, there are 4 bases (2 pairs). In a byte you have 4 positions for a basepair and 4 positions that indicate the type of basepair. Data-compression works but humans need readability. A single pixel in RGB code (3 values and an intensity value) uses 32byte. Mere 8 bits for a letter. Thus no point to make it a Mona Lisa, right? – ZF007 Jul 19 '18 at 06:40
  • 11
    ZF007, you missed my point about minimality. The question was: "How much memory would be *required* to store human DNA?" with further detail "...I'm looking for *minimal* value that would be able to store DNA of any human." You are [trying to answer a different question](https://en.wikipedia.org/wiki/Attribute_substitution), namely "How much memory would it take to store human DNA *in a readable form used by biologists to communicate genome data*?" if you compress the readable text data with good compression algorithm, that will bring its size well below 2 bits per basepair. – il--ya Jul 20 '18 at 10:12
  • 1
    as stated by OP ***I'd like to see how others would approach this issue***. Addressing this issue means to keep information humanly readable without required fancy tools to install which is a general NO-GO Life Sciences. **Minimalism** as you state it il--ya is basically an encode/encrypt/compression operation on the binary code string and thus becomes unsearchable. Also, from a long term perspective continues compression/decompression actions on a more or less unaccessible dataset slows down a genome/chromosome search tremendously which costs more money than an extra SDD/HDD or 2gig of RAM. – ZF007 Jul 02 '19 at 06:26
  • Good answer. That link isn't working at the moment however by the way. – sdanse Mar 10 '20 at 13:31
  • @sdanse ... seems an expired link indeed. It can be found here with some effort if you download the fasta yourselves: https://www.ncbi.nlm.nih.gov/grc/human/data – ZF007 Mar 11 '20 at 00:16
  • @sdanse ... here it is: https://ftp.ncbi.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.3/ – ZF007 Mar 11 '20 at 00:35
0

All answers are leaving off the fact that nuDNA is not the only DNA that defines a human genome. mtDNA is also inherited and it contributes an additional 16,500 base pairs to a human genome, bringing it more in line with the Wikipedia guess of 770MB for males, and 756MB for females.

This does not mean that a human genome can easily be stored on an 4GB USB stick. Bits do not represent information by themselves, it is the combination of bits that represent information. So in the case of nuDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes that in themselves would requires many MBs of raw data to represent, especially in terms of functionality.

Food for thought: 80% of the human genome is called "non-coding" DNA, so did you actually really believe that the entire human body and brain can be represented in a mere 151 to 154MBs of raw data?

ar18
  • 335
  • 2
  • 5
-3

There is only 2 types of base pairs, Cytosine can only bind to Guanine, and Adenine can only bind to thymine, So each base pair can be considered a single bit. This means that an entire strand of Human DNA ~3 billion "Bits" would be right around ~350 megabytes.

  • 4
    You have 2 types of pairs, and they can be in two directions - so you need two bits for each pair. This is why most posts above write ~700MB, and not 350MB. – Trondster Oct 23 '17 at 07:52
-3

One base -- T, C, A, G (in the base-4 number system: 0, 1, 2, 3) -- is encoded as two bits (not one), so one base pair is encoded by four bits.

  • 2
    Except that bases in a pair compplement each other, so don't add any information. So both base and base pair can be encoded with two bits. – il--ya Jul 06 '18 at 18:43
  • If you have an "A" what do you complement it with? "AC" "AG" "AT" are all valid. Likewise, if you have "T" the "TG" "TC" "TA" are valid , So what do you do? – Roger Johansson Nov 01 '18 at 12:40
  • 1
    @RogerJohansson No, only the “AT” base pair is valid in DNA. Likewise for “TA”, “CG” and “GC”. No other base pair combination exists. – Konrad Rudolph Feb 18 '19 at 09:47
  • @KonradRudolph there are at least nine purines (https://en.wikipedia.org/wiki/Purine). All of them can be used to substitute A or G. This would make the solution to OP's question more complex. I agree to keep it simple and stick to A, G, T and C. – ZF007 Jul 02 '19 at 06:49
  • 1
    @ZF007 They exist but they do not occur stably in human genomes and are therefore not relevant for genome storage. Their biological relevance is important only in the context of mutations (and there only transiently) and RNA modifications. In particular (in the context of this answer), genomic data isn’t stored as “base pairs”, it’s stored as a sequence of single bases, and each position can be encoded in two bits. This isn’t theoretical, this is how it’s *actually* done (except that, for most applications, genetic data is stored in (gzipped) ASCII, not bit-compressed). – Konrad Rudolph Jul 02 '19 at 09:49