GenData 2020: Data-Centric Genomic Computing

Prevention, treatment, management and cure of diseases are all underpinned by the fundamental understanding of their causes, processes and impacts. The technology for fast DNA sequencing appears as the main innovation factor of the next decade: high throughput devices will soon enable reading the whole genome much faster, at higher resolution, and at lower cost, thereby giving us the data to answer fundamental biological questions and open the ground to personalized genetic medicine. While genetic sequencing is “mature” - future advances will concern the number and length of sequences produced per unit of time or the precision of nucleotide identification - a quantum leap is now needed for building the computing infrastructure at the receiving end of DNA sequencing machines. In particular, current genomic data management is struggling on the “initial” problem of storing the data which are fast produced by biologists in their laboratories. A powerful data infrastructure is required for going beyond pure storage, and enabling viewing, querying, analyzing, mining, and searching over a world-wide available collection of genetic data. The vision of the GenData 2020 project is that it is now possible to build the abstractions, models, and protocols for supporting a network of genomic data, made available by genome servers located in the major biologist laboratories in the world. The huge amount of data and the diversity of the platforms and formats yields to a major data management challenge: how to model and store genetic data so as to foster their accessibility.

In this project we envision the organization of genetic data through a GenData 2020 data model, expressing the various features that are embedded in the produced biomolecular data, or in their correlated phenotypic data that can be extracted from clinical databases, or in the information inferred by applying data analysis methods to them. A specific genomic data fragment, described through its physical format and identified by a URI, corresponds to a given individual (not necessarily human), also identified by a URI. The model will include post-alignment data formats accepted by current genome browsers; on top, we define logical views that will allow powerful data reductions, related to the specific biomolecular experiment, to genomic traits, and to the specific pathology or research protocol which has motivated the experiment. The data model will enable the dynamic addition of new data types and their relationships, thereby accommodating the tumultuous evolution of DNA-related research; however, genetic data - based on next generation sequencing - is stable and not going to change. Therefore, we believe that the time is ripe for a major, data-driven paradigm shift in genomic computing, with the objective of linking and organizing genomic data scattered all over the world. This paradigm shift has the potential of giving to medical science a boost of productivity, similar to the impact that Web search exhibited in the last decade. To complete the vision, the data model must be supported by methods for querying, searching, and analyzing genomic data; genomic data must be related to ontological knowledge capable of explaining the biological and clinical phenomena according to cumulated science; must be integrated with clinical records, expressing the phenotype of individuals which is essential for clinical findings, extracted from many heterogeneous data sources; and data must be stored in a way that can be trusted and preserved, tracking its provenance, and protecting the individual privacy and security.

In this setting, the GenData 2020 project puts together nine partners including the strongest leaders of the Italian data management community, featuring high international standing and visibility, with ten scientists featuring an H-index above 30. Although many have been working on life sciences before, they are primarily known as leaders in their own foundational disciplinary research; they will merge their expertise, with the ambition of producing disruptive innovation at the receiving end of DNA sequencing machines, bringing new perspectives which come from their broader disciplinary experience. They will do so by working in strong contact with biologists and clinicians, and sharing their requirements, goals, methods, and passion. The project will produce foundational research and demonstrators, through selected use cases; it will be measured through publishing in high-impact venues and model/language/protocol standardization at Web standard bodies.