DNA: The Future of Data Storage
1,000,000,000,000,000,000,000 bytes. With 21 zeros, it's safe to say that's a big number. Now try multiplying that number by 175 times.
Based on research from the International Data Corporation (IDC), that number represents the expected sum of the world's data by 2025. According to David Reinsel, senior vice president at IDC, "If one were able to store 175 ZB onto Blu-ray discs, then you'd have a stack of discs that can get you to the moon 23 times."
How can we save such an astonishing amount of data? While hard drive manufacturers are racing to produce more devices with bigger capacity, scientists are looking in a different direction – storing information in one of the most ancient media available: DNA.
What is DNA data storage?
Simply put, DNA data storage is a technology that allows us to store digital information within DNA sequences.
Computers record information by using a binary system, 0s and 1s. By comparison, DNA uses four bases, adenine (A), thymine (T), cytosine (C), and guanine (G). By "translating" (transcoding) the bits 0 and 1 into ATCGs, and "writing" (synthesizing) the new bases into DNA molecules, digital information can be stored in DNA and preserved.
Reading the data is relatively simple; one would just have to conduct the procedure in reverse. In other words, sequence the DNA.
Why use DNA data storage?
Three obstacles are faced by digital information storage - the capacity, size, and lifespan of storage hardware.
In 2018, more than 2.5 quintillion bytes of data were created every day. At the beginning of 2020, the number of bytes in the digital universe was 40 times greater than the number of stars in the observable universe.
There is just not enough digital capacity to store this data in a way that makes it easy to retrieve. Then there is the problem of the physical storage space required for the hardware itself, and that's before hardware lifespans are considered.
Most hard drives manufactured today won't last more than five years. Flash drives perform slightly better, lasting closer to 10 before their ability to function reliably begins to slip. That means none of these devices will last long enough for you to pass your kids' childhood pictures to your grandkids, let alone to your great-grandkids.
This is where DNA data storage offers unparalleled advantages. According to an estimate by scientists from Harvard University and Johns Hopkins University published in Science, 1 gram of DNA can store as much as 455 exabytes (EB) of data (an exabyte is only three 0's shorter than a zettabyte).
DNA is also ubiquitous; as the genetic information carrier located in the cell nucleus of most organisms, information storage options abound.
DNA also has longevity. It has a half-life of ~521 years, and it is theorized that the information encoded on it could still be recovered 1.5 million years from now. DNA data storage also requires little maintenance and has no format compatibility issue when extracting the data.
What has been achieved so far?
Successfully storing data in DNA has been an area of study since the end of the 20th century. Researchers first successfully stored words via DNA in 1999, an achievement which was eclipsed in 2018 when 200Mb was successfully stored.
In 2017, a bit-to-base transcoding method called "DNA Fountain" reported by Science achieved "the theoretical maximum for information stored per nucleotide". Researchers from Columbia University stored a total of 2.14 × 106 bytes of data - a full computer operating system, movie, and other files – into short DNA molecules, and retrieved by DNA sequencing. The lead author Yaniv Erlich said this technology essentially encodes files in DNA as very simple Sudoku puzzles.
Today, researchers from BGI-Research have further improved the robustness of the transcoding process by creating a new codec method, the"Yin–Yang" codec system.
Inspired by the traditional Chinese concept, the "Yin–Yang" codec uses two rules to encode two binary bits of a wide variety of data types into one nucleotide, to generate DNA sequences. This codec has the advantage of high bio-compatibility to DNA synthesis and sequencing process.
Sequencing results show that even after more than 10,000 times dilution of DNA molecules that encode stored data, the average data recovery rate reaches 99.9%, indicating the robustness and reliability of this system.
This codec has also successfully demonstrated that a physical close-to-maximum density of nearly 432.2Eb per gram of DNA can be achieved by coding data into yeast cells.
In 2019, TIME magazine awarded the"Best Invention" award to the world's first DNA-based platform for massive digital data storage and computation. This gives us one more reason to believe that DNA, due to its exclusive natural advantages, has great potential in terms of information density, replication and maintenance costs, and service lifespan, and might become a key alternative for future data storage.
A new era of data storage is dawning.