Tuesday, 27 June 2017

The data are out there! How Repositive can help you find and access human genomic data

The power of data sharing and how I met Repositive at ESHG 2017

I’m part of a small Italian research group working on genetic basis of psychiatric diseases and we have a couple of bench-top sequencers to support small NGS based projects for our Institution. However, since the introduction of genomic technologies, the real business in studying genetic basis of diseases has became GWAS/WES of thousands samples, an effort well beyond our resources.
I have a magnet above my desk citing E. Rutherford to remember me which is the winning approach for us: “We haven’t got the money, so we’ve got to think!”. 
Indeed, we soon realized that we can’t afford discovery experiments competing with large consortia, but we may be able to think out-of-the-box and find new ways to aggregate genomic data to investigate specific biological mechanisms.
Luckily for us, and for researchers overall, the genomic field is quite open to data sharing and publication of raw data is often mandatory for papers involving NGS. As result, huge amount of data has accumulated during the years in data repository like SRA archive, ENA, GEO and there is almost no platform, method or phenotype that doesn’t have some data ready for you to grab. 

I was quite excited to attend ESHG 2017 conference in Copenhagen, looking for new smart ideas and possible collaborations to produce good science at low budget.
During a coffee break, I was walking through company stands hoping that other attendees have left some pastries. Found one and refilled with sugar, I thinked back to talks of the day on complex diseases and multi omics integration and start wondering if we may apply these approaches to our study of regulatory landscape in Schizophrenia. I proposed this idea to collaborators wondering if we can put together a pilot study. We concluded that we may need some additional data and decided to try to explore repositories once I’ve came back home. So where to go to close the day? I looked at my program for evening corporate satellites and saw meeting “Find the most suitable genomic data repository for your needs”.
This sound exactly what we need, a way for effective and rapid search of genomic information! So I decided to stop by their stand to take a first look and talk about their service. After a short chat with their friendly staff and a brief overview of their web portal, I signed for the T-shirt lottery and their evening meeting. From the beginning, Repositive emerged as a friendly and open-minded group of people who believes in the power of data sharing and community effort to boost genomic science. 

The evening talk was a fun and informative overview by Repositive leader Manuel Corpas, which underlined the main feature of their search engine and the community-based approach they implemented to improve datasets information. Using also some fun real-world examples, Corpas illustrated their different look to genomic data. With an incredible amount of data rapidly accumulating, they understood that genomic field could take advantage of indexing and searching approaches previously applied for the world wide web data. Indeed, data are useless if you can find them and effective search engines would improve sharing and avoid wasting resources on replicated efforts. Moreover, this would fit perfectly in genomic field allowing small research group (like us) to develop their ideas leveraging on large community based efforts, maximizing the scientific value of each dataset.
What’s Repositive?
So what’s Repositive? And what can it do for me? As Corpas summarized during his talk: Repositive want to be the of genomic data. In my opinion, Repositive web service is a bit of Google search engine, a bit of and a bit of Wikipedia. The idea is to provide a centralized search engine where all genomic data repositories are indexed and can be searched by specific queries and filters. Each dataset is provided with description and attached meta-data (like platform, approach, sample source…). A second aspect is the Wikipedia like features of Repositive, where registered users can contribute to datasets by providing additional details and meta-data. Datasets can also be shared and watched to keep track of updates and changes and users can comment datasets like hotels on Finally, users can register their own datasets or request for new repositories to be added, as well. 
Since the amount of available data in genomics is rapidly growing and datasets are spread across several repositories, Repositive search engine has the potential to make data retrieval simple, quick and effective. The platform is now in its infancy, but it already include data from 43 repositories and a user-friendly web portal. Provided filters allow to restrict searches based on data accessibility, experimental approaches and source repository. The community based features has great potential to improve dataset description, but they impact depends largely on how much Repositive will be able to spread across scientific community. 

Open questions and possible improvements
To provide an effective data search engine is essential to extract relevant information and made them searchable / filterable. To reach this challenge they have to face with the non-standardized ways of representing meta-data and manual revision should be need to actually provide all these information. This seems an enormous manual work to deal with… As Corpas underlined, part of the success of Repositive will depend on how many researchers decide to use and contribute in the platform. A large support by scientific community will indeed enrich meta-data and give Repositive influence to ask repositories for standardized representations. 
By now, datasets are represented singularly by sample and not aggregated by project and available filters are still limited. By myself, I often want to be able to search for an entire project, like RNA-seq in a specific disease with case and controls, rather than by single sample data. Moreover, a filter on number of samples and platform used (i.e. hi-seq, bead-array, ion…) are needed to better define useful data.
They are also thinking about some kind of centralized management for data access request and automation of data access, so that for both open and some managed access data, Repositive will became a search, click, download platform. This will really make the difference! 

My personal experience with Repositive
By myself, I’m quite a fan of data and knowledge sharing, since it can really speed up research, avoiding redundant efforts and allowing for quick and effective hypothesis testing.
Taking advantage of the increasing amount of accessible genomics data, we gradually shifted from production to analysis, trying to integrate our small scale experiments with big data produced from others to improve our hypothesis and investigate those aspects that was not addressed by the original authors. In this process, we soon discovered that even if data are out there, they are far from simple to catch! Indeed, they are spread across multiple repositories and several ones are not so easily searchable. Moreover, several data are under restricted access that requires complicated administrative procedures.

As soon as I got back to the lab after ESHG17, I gave a try to Repositive to find out if it could help me retrieve useful data. My first experiment was an easy one: search for additional genomic data on NA12878 reference sample that I could use to develop variant filtering algorithms. Using Repositive, I found about 40 WES and 1600 WGS, all open access and spanning different platforms. Good! A lot of material to work on. The second experiment was to find some WGS and transcriptome data for major depressive disorder. Here, the task was a little more difficult, since I want to retrieve case/control studies and, as I mentioned above, the present Repositive interface list samples one by one and not aggregated project data. However, I discovered CONVERGE data are available for download (around 11k low-coverage WGS and variant calls), as well as some interesting gene expression data from GEO repository. I’ve also noted that data from NIMH genetic repository are missing...I will request to have them added!
Looking around their blog, I also found suggestions and walk-throughs for data access in managed systems that can help newbies in their first time application at dbGaP or similar platforms.
Based on my first quick experience, Repositive can be extremely useful for our analysis projects making data retrieval easier. However, some improvements (project based and platform based filters for example) are surely needed to allow effective retrieving of useful data.

Happy sharing to everybody!

Wednesday, 15 February 2017

AGBT 2017

Follow us as we fly to AGBT in Miami this week to see all the latest exciting news in genomic world! Follow #agbt17 or our account @supergecko on Twitter for constant updates!

Sunday, 22 May 2016

ESHG 2016 is running

The annual meeting of the Europea Society of Human Genetics is currently running in Barcelona. Follow on Twitter (#ESHG2016) us and other researchers from all across Europe as we share amazing new science and technology.

Thursday, 21 April 2016

There are exomes in Brescia!

Finally, after about two years of sequencing, our small lab was able to publish a little contribution to the world of exome sequencing. In the paper we performed a detailed evaluation of exome sequencing performances and technical optimizations using the Ion Proton platform with the Hi-Q chemistry.
No more than a drop in the ocean, but we are quite proud of it!

The article is published in open-access so everyone interested can take a look!

Amplicon-based semiconductor sequencing of human exomes: performance evaluation and optimization strategies
E.Damiati, G.Borsani, E.Giacopuzzi
Human Genetics, May 2016, Volume 135, Issue 5, pp 499-511

Wednesday, 13 April 2016

Resilience project identifies the first 13 genetic heroes!

The Resilience Project has just published on Nature Biotechnology a new paper on the analysis of genomic data from more than 500k subjects in search for the so called "genetic heroes". The authors first aggregated genomic data from various sources, including 23andMe genotyping database, 1000G, ESP6500 and UK10K sequencing projects, Sweden exomes for schizophrenia research, CHOP sequencing program and others, to reach a total of 589,306 subjects with genomic data. Then they applied a strict filtering criteria to identify 13 healthy people bearing a pathogenic mutations for severe Mendelian childhood disease without showing any clinical symptoms.

Analyzing genomic data from these 13 "genetic heroes" the authors are now trying to study protective variants to understand the molecular mechanisms that have rescued the pathogenic mutations, with the potential to provide useful insight on how to treat the corresponding disease.

Thursday, 10 March 2016

Recent interesting facts in genomics!

Human genetic knockouts point to a resilient human genome

According to this paper published in Science, the human genome is more resilient than previously expected and can tolerate a certain amount of disrupted genes without any observable phenotypic effect. The authors "sequenced the exomes of 3222 British Pakistani-heritage adults with high parental relatedness, discovering 1111 rare-variant homozygous genotypes with predicted loss of gene function (knockouts) in 781 genes. [...] Linking genetic data to lifelong health records, knockouts were not associated with clinical consultation or prescription rate.".
Interested? Read the full paper: "Health and population effects of rare gene knockouts in adult humans with related parents"

Genetic alterations in regulatory elements could predict personal health history

This paper published on PLoS Computational Biology analyze the impact of personal genetic variants on conserved regulatory elements and how this information could be used to predict health related traits. By analyzing transcription factor binding sites disrupted by an individual’s variants and then look for their most significant congregation next to a group of functionally related genes, the authors found that the top enriched function is invariably reflective of medical histories. As stated by authors these "results suggest that erosion of gene regulation by mutation load significantly contributes to observed heritable phenotypes that manifest in the medical history". They also developed a computational test to interpret personal genomes based on their approach that "promise to shed new light on human disease penetrance, expressivity and the sensitivity with which we can detect them".
Interested? Read the full paper: "Erosion of Conserved Binding Sites in Personal Genomes Points to Medical Histories"

Don't forget about exonic splice-affecting mutations

In this interesting paper on PLoS Genetics authors evaluate the prevalence of splice-affecting exonic variants. This kind of variants are often neglected in the canonical pipelines searching for causative mutations, even if aberrant splicing can obviously have a major impact on gene function. Using MLH1 as a model gene, the authors found that the frequency of this kind of mutations is higher than expected, suggesting that they deserve more attention in future analysisi. Moreover the paper also provide with a comparative evaluation of different in silico prediction alghoritms assessing their performance in splice-affecting variants classification.
Interested? Read the full paper: "Exonic Splicing Mutations Are More Prevalent than Currently Estimated and Can Be Predicted by Using In Silico Tools"

The health impact of your Neanderthal ancestry

Another interesting story published recently on Science journal pointed out the influence of Neanderthal ancestry on human health-related traits. The authors analyzed how alleles inherited from Neanderthals in present European population impact clinically relevant phenotypes and they found associations for neurological, psychiatric, immunological, and dermatological phenotypes. The results indicate that archaic admixture influences disease risk in modern humans, including risk for depression and skin lesions resulting from sun exposure, hypercoagulation and tobacco use.
Interested? Read the full paper: "The phenotypic legacy of admixture between modern humans and Neandertals"

A map of transciptomic cellular landscape in visual cortex by single cell RNA-Seq

This study from Nature Neuroscience used single cell RNA-Seq on more than 1,600 cells to construct a cellular taxonomy of the primary visual cortex in adult mice. Authors identified 49 transcriptomic cell types, displaying specific and differential electrophysiological and axon projection properties, confirming that the single-cell transcriptomic signatures can be associated with specific cellular properties. These results open new perspective on cell level organization within brain tissue, first of all on the potential causal relationships between transcriptomic signatures and specific morphological, physiological and functional properties. Another interesting point, as noted by the authors in to investigate if "certain transcriptomic differences [are] representative of cell state or activity, rather than cell type.
Interested? Read the full paper: "Adult mouse cortical cell taxonomy revealed by single cell transcriptomics"

Wednesday, 17 February 2016

Asia starts its own population sequencing project

After the success of sequencing projects like 1000G and UK10K and the recent start of the 100k Genomes Project in UK and the Precision Medicine Initiative in US, now also Asia enter the field of population scale genomics!

Indeed, a new non-profit consortium, GenomeAsia 100k, has been announced with the plan to sequence 100k individuals from populations throughout South, North, and East Asia, with the goal of creating phased reference genomes for all major Asian ethnic groups. The sequencing of 100,000 individuals will be combined with micro-biome, clinical and phenotype information to allow deeper analysis of diseased and healthy individuals.

The group is led by the Nanyang Technological University, with two private company as partners for sequencing and analysis services: Macrogen and MedGenome.

Read the official new here or the report from GenomeWeb.

Friday, 29 January 2016

System biology provides a magic wand for cell reprogramming

This is not exactly genomic, but it is based on genomic data and so fascinating that I had to report about it!

In a letter appeared recently on Nature Genetics, Rackham et al. proposed the Mogrify tool, a web-based interface that can predict the minimum set of trancription factors (TF) needed for a specific cell type reprogramming. The idea that somatic cells can be reprogrammed across different cell types has been around for a while, but a systematic assessment of the conditions needed for each conversion has never been carried out, mainly due to the amount of time and effort needed to test the various combination of TFs experimentally.

The authors took advantage of the huge amount of data produced by the FANTOM project to calculate the specific transcriptional landscape of different cell types and then developed a method that could predict which TFs should be overexpressed to move a specific cell type to another one.

With Mogrify it could be now much easier to perform reprogramming experiments, an approach with a potential great impact for regenerative medicine.

The idea behind this interesting tool is summarized in Figure 1 from the paper as you can see below

The Mogrify algorithm for predicting transcription factors for cell conversion.

Thursday, 28 January 2016

Illumina get small with new miniSeq and FireFlight

After bullying the NGS market a couple of years ago with its new large sequencing platforms able to deliver thousands of human genomes per year, now Illumina decided to further expand its offer with also small solutions.

Few days ago at JP Morgan the company revealed a new benchtop sequencer aiming at small labs and clinical application based on gene panels. The new machine, called miniSeq, can produce up to 8 Gb per run with 25M reads, costs "only" ~50k$ and promise a cost per run between $200 and $300.
With this move Illumina try to challenge Ion Torrent PGM and the Thermo Fisher good position in the area of small, rapid an cheap sequencing. The new sequencer is based on the two-color technology developed for the NextSeq and HiSeq X sequencers allowing to reduce machine cost and speed up the sequencing runs.

Look how it compares to the existing Illumina benchtop sequencers:

In its try to fill all the space in the market Illumina has also annunced an even smaller sequencer, developed under the name of "project FireFlight". The details released included a 1.2Gb output, one "colour" SBS and patterned flowcells. It also has a digital fluidics library prep module. The machine alone might cost just $15,000, with under $200 cost per run. The new machine will be based on semiconductor technology with CMOS sensor collecting light data from multiple simultaneous reactions. Project Firefly will be developed through 2016 and expect to deliver in 2017.

More details and interesting consideration have been provided on the Illumina website and other omics blogs, like CoreGenomics and OmicsOmics.

Thursday, 1 October 2015

1000G and UK10K publish results of large scale human genome sequencing

Risultati immagini per human genetic variability

Both 1000G and UK10K consortia have recently published the results of their analysis on the variability of human genomes, based on their large scale genomics projects.

In the 1000G papers appeared on Nature the consortium described SNV and structural variants findings based on the phase 3 dataset. 
Citing the abstract, they have analyzed "2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries."

You can find the papers here:
A global reference for human genetic variation (Nature, 2015)

An integrated map of structural variation in 2,504 human genomes (Nature, 2015)

UK10K LogoThe UK10K consortium also published a detailed description of the human genetic variability based on around 10,000 samples, partially low coverage WGS of control samples and partially deep covered WES focused on various complex and rare diseases.
Citing the abstract, "Here we describe insights from sequencing whole genomes (low read depth, 7×) or exomes (high read depth, 80×) of nearly 10,000 individuals from population-based and disease collections. In extensively phenotyped cohorts we characterize over 24 million novel sequence variants, generate a highly accurate imputation reference panel and identify novel alleles associated with levels of triglycerides (APOB), adiponectin (ADIPOQ) and low-density lipoprotein cholesterol (LDLR and RGAG1) from single-marker and rare variant aggregation tests. We describe population structure and functional annotation of rare and low-frequency variants, use the data to estimate the benefits of sequencing for association studies, and summarize lessons from disease-specific collections."
By the way, they published also an improved haplotype reference panel that can be used to improve imputation of low-frequency and rare variants also developed an online tools to explore their association results.
The third paper is a first example of the disease-oriented results obtained by the consortium: they identified EN1 as a gene involved in reduced bone density and recurrent fracture.

You can find the papers here:
The UK10K project identifies rare variants in health and disease (Nature, 2015)

Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel (Nature Communications, 2015)