De
novo genome assembly plays
a key role in computational biology as a contiguous and accurate
genome
reconstruction acts as a starting point for subsequent
functional analyses.
Long-read platforms integrated by short-reads for polishing
residual errors in
the assembled contigs enabled tackling genome assembly also for
non-model
organisms at reasonable costs. However, long
repetitive/duplicated regions
still limit the assembly to contigs and requires scaffolding
using an
additional long-range platform to increase contiguity, possibly
reaching
chromosome-level assemblies. Therefore, de novo genome
assembly is still a
step-by-step process, that requires the iterative integration
of additional
data layers. We start from PacBio HiFi or Oxford Nanopore
(ONT) long reads
to produce the contigs, than we use Illumina reads for
polishing. Despite the
improving quality of long reads, we still consider the polishing
a necessary
step to remove InDels, the major cause of imprecise gene
annotation because of
the frameshift errors.
Then,
we use Hi-C technology and/or optical (Bionano Genomics) and electronic (Nabsys)
maps for scaffolding.
The pan-genome represents the entire set of genes within a species,
consisting of a core genome - containing sequences shared between all
individuals of the species - and the ‘dispensable’ genome. The idea of a
pan-genome was first conceived for bacterial species in 2005, when the
genomes of six strains of Streptococcus agalactiae were sequenced, revealing a core genome containing 80% of S. agalactiae genes.
By de novo assembly of RNA-seq data, in 2013 we reported that the
high polyphenol content of grapevine cultivar Tannat is conferred primarily by genes that are not shared with the reference
genome, paving the way to pan-genome studies in plants and helping the replacement of the term ‘dispensable’ with ‘accessory’. The
discovery that plant varieties/ecotypes can be caracterised by
sets of
proprietary genes and not only by a proprietary combination of
different
alleles of the same set of genes required a
tremendous effort. Today, putting together a pan-genome for complex genomes is facilitated by improvements in genome sequencing
technologies, particularly long-read sequencing. In collaboration with Roberto Papa we are currently constructing the bean pan-genome.