160620 sole nomics v2

BioIn4Next
Bioinformatic platforms for the study of
marine organisms
M. Gonzalo Claros
Dpto Biología Molecular y Bioquímica
Plataforma Andaluza de Bioinformática
Universidad de Málaga
P.to
S.ta
M.ª 20-24/6/16
“Microalgae production technologies and applications
to marine fish aquaculture”
El Puerto de Santa María, 20-24 Junio
IFAPA centro El Toruño
1º Seminario del proyecto Algae4A-B
pecialistas en el
onsorcio
http://about.me/mgclaros/
@MGClaros claros@uma.es

BioIn4Next
Acuiculture is becoming a key source of food
2

BioIn4Next
Acuiculture is becoming a key source of food
2
All of them are non-model organisms

BioIn4Next
Non-model organisms: our expertise
3
http://www.scbi.uma.es/sustainpinedb/
http://www.juntadeandalucia.es/
agriculturaypesca/ifapa/soleadb_ifapa/
http://reprolive.eez.csic.es/
http://www.scbi.uma.es/pgc/
http://mejgenvegetal.uco.es/fgb2/gbrowse/Ca/
CicerDB

BioIn4Next
Combinatory strategy
4

BioIn4Next
4
None is
the best

BioIn4Next
4
None is
the best
The best result is obtained combining at least
two different tools for the same analysis

BioIn4Next
Picasso: SuperComputing & BioInformatics @ UMA
5
Hard disks
7 FAT
nodes
Computing
nodes
THIN
nodes
More disks
GPU
nodes
768 cores
3 TB RAM
8 GB/core
80 cores
2 TB RAM
>25 GB/core
32 GPU
1 TB RAM
8 GB/core
984 cores
4 TB RAM
4 GB/core
Picasso:  
2310 cores
700 TB disk

BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es

BioIn4Next
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
SOFTWARE Open Access
SeqTrim: a high-throughput pipeline for
pre-processing any type of sequence read
Juan Falgueras1
, Antonio J Lara2
, Noé Fernández-Pozo3
, Francisco R Cantón3
, Guillermo Pérez-Trabado2,4
,
M Gonzalo Claros2,3*
Abstract
Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing
data. This requires increasing sequence quality and reliability in order to avoid database contamination with
artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-
processing algorithms.
Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-
published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality,
vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of
several input and output formats allows its inclusion in sequence processing workflows. Due to its specific
algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It
performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing
reads and does not lead to over-trimming.
Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including
next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know
what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual
sequence if desired. The recommended pipeline reveals more information about each sequence than previously
described pre-processors and can discard more sequencing or experimental artefacts.
Background
Sequencing projects and Expressed Sequence Tags
(ESTs) are essential for gene discovery, mapping, func-
tional genomics and for future efforts in genome anno-
tations, which include identification of novel genes, gene
location, polymorphisms and even intron-exon bound-
aries. The availability of high-throughput automated
sequencing has enabled an exponential growth rate of
sequence data, although not always with the desired
quality. This exponential growth is enhanced by the so
called “next-generation sequencing”, and efforts have to
be made in order to increase the quality and reliability
of sequences incorporated into databases: up to 0.4% of
sequences in nucleotide databases contain contaminant
sequences [1,2]. The situation is even worse in the EST
databases, where vector contamination rate reach 1.63%
of sequences [3]. Hence, improved and user friendly
bioinformatic tools are required to produce more reli-
able high-throughput pre-processing methods.
Pre-processing includes filtering of low-quality
sequences, identification of specific features (such as
poly-A or poly-T tails, terminal transferase tails, and
adaptors), removal of contaminant sequences (from vec-
tor to any other artefacts) and trimming the undesired
segments. There are some bioinformatic tools that can
accomplish individual pre-processing aspects (e.g. Trim-
Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4],
crossmatch, Figaro [5]), and other programs that cope
with the complete pre-processing pipeline such as
PreGap4 [6] or the broadly used tools Lucy [7,8] and
SeqClean [9]. Most of these require installation, are dif-
ficult to configure, environment-specific, or focused on
specific needs (like a design only for ESTs), or require a
change in implementation and design of either the pro-
gram or the protocols within the laboratory itself.
* Correspondence: claros@uma.es
2
Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071
Málaga, Spain
Falgueras et al. BMC Bioinformatics 2010, 11:38
http://www.biomedcentral.com/1471-2105/11/38
© 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84

BioIn4Next
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
Juan Falgueras1
, Antonio J Lara2
,
Abstract
Background
2
Málaga, Spain
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo Pérez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cantón1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de Málaga, Campus Universitario
de Teatinos, E-29071 Málaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Informática, Campus de Teatinos,
E-29071 Málaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de Málaga
29071 Málaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.

BioIn4Next
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
Juan Falgueras1
, Antonio J Lara2
,
Abstract
Background
2
Málaga, Spain
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
Antonio J Lara1
,
1
2
3
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
More than one
equivalent tool

BioIn4Next
Choosing the best assembling in non-model organisms
7

BioIn4Next
7
1
2

BioIn4Next
7
1
2
Weighted PCA analysis

BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*

BioIn4Next
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
&
REFERENCE
TRANSCRIPTOME
OPT
Antonio J Lara1
,
1
2
3
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suﬀers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identiﬁcation programs perform inconsistently as they are sensitive
to errors.
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
Recycling

BioIn4Next
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
&
REFERENCE
TRANSCRIPTOME
OPT
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Munõz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologıá del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
Antonio J Lara1
,
1
2
3
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
1 Introduction
ated [1] since:
no results;
to errors.
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
Recycling

BioIn4Next
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
&
REFERENCE
TRANSCRIPTOME
OPT
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Munõz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologıá del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
More than one
equivalent tool
Antonio J Lara1
,
1
2
3
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
1 Introduction
ated [1] since:
no results;
to errors.
Bean (Vicia faba)
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
Spain
Cordoba, Spain
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
Recycling

BioIn4Next
Our bioinformatic contribution to aquaculture
9
Transcriptomes
Solea senegalensis
Solea solea
Tisochrysis lutea
Ruditapes decussatus
Genomes
Solea senegalensis
Photobacterium damselae
subsp. piscicida (x2)
SNPs
Mytilus edulis
Crassostrea angulata
Human
food
Human
food
Aquaculture
feed
Human
food
Aquaculture
diseases
Human
food
Human
food
Tetraselmis chuii

BioIn4Next
Bioinformatics tools based on
transcriptomes
10
e production technologies and applications

BioIn4Next
NGS read pre-processing for 2 sole transcriptomes
11
NGS platform
Illumina 454
Species S. senegalensis S. solea S. senegalensis
Total Input Reads 1,800,249,230 2,101,324,072 5,663,225
mean length 76 100 757
Rejected (total) N 237,941,945 345,251,849 1,562,661
% 13.5 17.1 26.8
by contamination N 144,247,943 226,627,909 156.921
% 8.2 11.2 3.0
Useful reads N 1,561,416,814 1,746,258,741 3,774,412
% 86.7 83.1 67.6
paired reads N 1,503,882,050 1,676,160,406 -
% 83.3 79.5 -
single reads N 57,534,764 70,098,335 3,774,412
% 3.2 3.3 67.6
mean length 66 89 184
Benzekri et al. BMC Genomics 2014, 15:952

BioIn4Next
Overview of the two sole transcriptomes
12
S. senegalensis S. solea
v3 v4 v1
Unigenes % Unigenes % Unigenes %
Total 252,416 100.00 % 697,124 100.00% 531,463 100.00%
>500pb 37,593 14.90 % 156,083 22.24% 165,860 31.22%
>200pb 168,914 66.92 % 385,411 54.92% 338,967 63.89%
Longest unigene 6,050 - 40,163 - 68,559 -
Misassembled 18 0.01 % 215 0.03% 116 0.02%
Putative chimera 984 0.39 % 6,345 0.91% 9,447 1.80%
Unigene report
With an orthologue 1
81,348 32.23 % 147,536 21.74% 121,696 22.90%
Different orthologue IDs 41,792 51.37 % 45,063 30.87% 38,402 31.56%
Complete ORFs 6,742 8.31 % 39,727 26.12% 52,051 42.77%
Different, complete ORFs 4,376 5.38 % 18,738 12.34% 22,683 18.64%
C-terminus 14,757 18.14 % 27,080 17.94% 19,579 16.09%
N-terminus 11,298 13.88 % 27,638 18,52% 25,131 20.65%
Internal 47,529 58.43% 53,091 37.42% 24,935 20.49%
Putative ncRNA 539 0.21 % 1,252 0.18% 1,075 0.20%
Without orthologue 1
171,067 67.56 % 545,491 78.08% 408,692 76.90%
Putative New Genes 22,612 13,21 % 39,812 7,49% 34,194 8,37%
Non-redundant put. new genes nc – 14,451 2,51% 14,528 3.55%
Unknown 147,916 86.48 % 506,679 92.51% 374,498 91.63%
Reference transcriptome nc – 59,514 8.85% 54,005 10.16%
Only
454
Only
Illumina
454 +
Illumina
Very useful

BioIn4Next
Soles are transcriptomically similar
13
0%#
7%#
0%#
9%# 1%#
0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#
2%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#senegalensis#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Response#to#s5mulus#
Reproduc5on#
Pigmenta5on#
Mul5cellular#organismal#process#
Mul5Aorganism#process#
Metabolic#process#
Locomo5on#
Localiza5on#
Immune#system#process#
Growth#
Biological#adhesion#
Biological#regula5on#
Cell#prolifera5on#
0%#
7%#
0%#
9%# 1%# 0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#1%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#solea#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Reproduc5on#
Pigmenta5on#
Metabolic#process#
Locomo5on#
Localiza5on#
Growth#
Cell#prolifera5on#
Biogenesis#
Cellular#process#
Death#
Developmentl#process#
7%#
4%#
4%# 1%#
3%#
5%#
4%#
1%#
0%#
30%#
0%#
41%#
S.#senegalensis#
Transporter#ac3vity#
Structural#molecule#ac3vity#
Receptor#ac3vity#
Protein#binding#trasncrip3on#factor#ac3vity#
Nucleic#acid#binding#transcrip3on#factor#ac3vity#
Molecular#transducer#ac3vity#
Enzyme#regulator#ac3vity#
Electron#carrier#ac3vity#
Channel#regulator#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
3%#
4%# 1%#
3%#
4%#
5%#
1%#
30%#
42%#
S.#solea#
Receptor#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
4%#
4%# 1%#
3%#
5%#
4%#
1%#
0%#
30%#
0%#
41%#
S.#senegalensis#
Receptor#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
3%#
4%# 1%#
3%#
4%#
5%#
1%#
30%#
42%#
S.#solea#
Receptor#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
0%#
7%#
0%#
9%# 1%#
0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#
2%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#senegalensis#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Reproduc5on#
Pigmenta5on#
Metabolic#process#
Locomo5on#
Localiza5on#
Growth#
Cell#prolifera5on#
0%#
7%#
0%#
9%# 1%# 0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#1%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#solea#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Reproduc5on#
Pigmenta5on#
Metabolic#process#
Locomo5on#
Localiza5on#
Growth#
Cell#prolifera5on#
Biogenesis#
Cellular#process#
Death#
Developmentl#process#
2%#
22%#
4%#
16%#
36%#
2%#
1%#
3%#
14%#
S.#senegalensis#
Synapse#
Organelle#
Membrane6enclosed#lumen#
Membrane#
Cell#
Cell#junc=on#
Extracellular#matrix#
Extracellular#region#
Macromolecular#complex#
1%#
22%#
4%#
16%#
36%#
2%#
2%#
3%#
14%#
S.#solea#
Synapse#
Organelle#
Membrane#
Cell#
Cell#junc=on#
2%#
22%#
4%#
16%#
36%#
2%#
1%#
3%#
14%#
S.#senegalensis#
Synapse#
Organelle#
Membrane#
Cell#
Cell#junc=on#
1%#
22%#
4%#
16%#
36%#
2%#
2%#
3%#
14%#
S.#solea#
Synapse#
Organelle#
Membrane#
Cell#
Cell#junc=on#
A
B
C
S. senegalensis S. solea
Biological process
Cellular component
Molecular function
USES

BioIn4Next
Soles and zebrafish are highly orthologous
14bution of the level of similarity between both sole reference transcriptomes for those transcripts with (dar
C Genomics 2014, 15:952
edcentral.com/1471-2164/15/952
Transcripts having a zebrafish
orthologue are more similar
between soles
Transcripts lacking a zebrafish
orthologous still are significantly
homologous between soles
USES

BioIn4Next
There are lineage-specific genes in teleosts
15
Likely protein
coding
W/O zebrafish
orthologue
Orthologs between soles of unknown function 137 351
Orthologs in other teleosts proteins:
Gadus morhua 7 155
Oryzias latipes 10 190
Oreochromis niloticus 17 241
Tetraodon nigroviridis 6 198
Gasterosteus aculeatus 17 235
In at least one of these species 27 290
Orthologs in Cynoglossus semilaevis DNA (flatfish) 99 287
Orthologs in teleosts but not in flatfish 3 46
Specific orthologs only in flatfish 75 43
Without ortholog 35 18
sole-specific genes
flatfish-specific
genes
USES

BioIn4Next
UNIGENES
S. senegalensis
v3
Complete
6,742
N-terminal
11,268
Internal
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
ncRNA
non-redundant
coding
21,314
Inconsistent
unigenes
51,218
SELECTED unigenes
for microarray
Selectthemost3',non-
redundantunigene
Select,non-redundant
completeunigene
Most 3', non-
redundant
incomplete
unigenes
34,291
Longest, non-
redundant,
complete unigenes
5,545
Selectlongerandnon-redundantunigenes
CD-HIT
Selection of unigenes qualiﬁed
as coding and with ORF
ORF-Predictor
Full-LengtherNext
21,099
30,119
Development of microarray and qPCR primers
16Benzekri et al. BMC Genomics 2014, 15:952
Feature selection
algorithm for
microarray printing

BioIn4Next
UNIGENES
S. senegalensis
v3
Complete
6,742
N-terminal
11,268
Internal
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
ncRNA
non-redundant
coding
21,314
Inconsistent
unigenes
51,218
SELECTED unigenes
for microarray
Selectthemost3',non-
redundantunigene
Select,non-redundant
completeunigene
Most 3', non-
redundant
incomplete
unigenes
34,291
Longest, non-
redundant,
complete unigenes
5,545
Selectlongerandnon-redundantunigenes
CD-HIT
Selection of unigenes qualiﬁed
ORF-Predictor
Full-LengtherNext
21,099
30,119
Development of microarray and qPCR primers
Feature selection
algorithm for
microarray printing
microarray provided repetitive and consistent positive
hybridization signals.
Conclusions
De novo transcriptomes of S. solea and S. senegalensis
covering their main developmental stages and organs were
described based on a combined assembly approach that
can be applied to other transcriptomic studies. The huge
volume of reads processed in each species (>1,800 millions,
the highest number of reads reported to date for any or-
ganism) produced a high number of transcripts that were
mined to obtain a representative reference transcriptome
Transcripts
S. senegalensis v3
Complete
6,742
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
non-redundant
coding
21,314
Inconsistent
transcripts
51,218
SELECTED transcripts
for microarray
Selectt
redun
Select,non
complete
Longest, non-
redundant, complete
transcripts
5,545
Selectlongerandnon-r
CD-HIT
ORF-Predictor
21,099
30,119
Figure 7 Schematic representation of the probe selection strategy for the construction of the Senegalese sole oligonucleotide
microarray. The number of transcripts that resulted after the described filtration is indicated.
Table 4 Validation of microarray data using qPCR
Microarray qPCR
SoleaDBcode Gene Gene name FC p-value FC p-value
Unigene18736 Angiotensin I converting enzyme 2 ace2 4.5 <0.001 4.9 <0.05
Unigene49603 Angiotensinogen agt 3.5 <0.01 4.7 <0.05
Unigene39473 Na-K-Cl cotransporter2 nkcc2 2.5 <0.01 3.13 <0.01
Unigene252320 Transferrin tf 15.6 <0.001 10.5 <0.01
Unigene214993 Ferritin fth 2.1 <0.01 2.3 <0.05
Unigene39196 Heat shock protein 90-alpha hsp90aa 2.7 <0.01 2.3 <0.01
Unigene54412 Trypsinogen1a try1 17.6 <0.001 12.0 <0.001
Unigene31826 Trypsinogen2 try2 4.7 <0.001 7.8 <0.05
Unigene53434 Chymotrypsinogen2 ctr2 7.2 <0.001 6.3 <0.05
Unigene52166 Elastase1 cela1 8.7 <0.001 7.8 <0.05
Unigene53593 Elastase4 cela4 7.1 <0.001 4.6 <0.05
Unigene54920 Complement component C3 c3 3.8 <0.05 34.0 <0.05
Unigene53521 Lysozyme g lyg 2.5 <0.05 3.6 <0.05
Unigene219622 Thyroid stimulating hormone, beta tshb 2.5 <0.05 4.6 <0.001
Unigene52404 Transaldolase taldo 2.1 <0.05 2.5 <0.05
Fold-changes (FC) and p-values obtained for target genes by microarray and qPCR are indicated. Moreover, the transcript code in the SoleaDB for S. senegalensis
v3 transcriptome is also shown. For qPCR, data were normalized to those of gapdh2 and referred to the calibrator group (36 ppt 3 DPH).
Microarray
validation
0"
10"
20"
30"
40"
50"
60"
C1qlike" c2" c3" c401" c402" c5" c9" factor"h"
Rela%ve'gene'expression'
Genes'
37'ppt'
10'ppt'
6"
8"
e'expression'
20"
25"
30"
35"
e'expression'
A
B
D
F
E
*
*
* *
*
*
*
*
0"
1"
2"
3"
4"
5"
ptgs1a" ptgs2"
Genes'
*
0"
100"
200"
300"
400"
500"
600"
il1b" il11a" il8b"
Genes'
*
*
*
*

BioIn4Next
SoleaDB: transcriptome database
17
http://www.juntadeandalucia.es/agriculturaypesca/ifapa/soleadb_ifapa/

BioIn4Next Benzekri et al. BMC Genomics 2014, 15:952
Current contents of SoleaDB
18

BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19

BioIn4Next
19
About the
assembling

BioIn4Next
19
About the
assembling
Download the
complete transcriptome

BioIn4Next
19
About the
assembling
Download the
Download all
annotations

BioIn4Next
19
About the
assembling
Download the
Download all
annotations
Download the
full information
for a subset of
transcripts

BioIn4Next
19
About the
assembling
Download the
Download all
annotations
Download the
full information
for a subset of
transcripts
Download
raw reads

BioIn4Next
Browsing by transcript

BioIn4Next
Filtering options for
deployed transcripts

BioIn4Next
More speciﬁc ﬁltering/searching

BioIn4Next
Paginated

BioIn4Next
Paginated
Included in the
representative transcriptome

BioIn4Next
About one particular transcript
21

BioIn4Next
Markers: SNPs and SSRs

BioIn4Next
Markers: SNPs and SSRs
deployed SSRs

BioIn4Next
SoleaDB: a huge source molecular markers
23
representation of GATA repeats (<0.2% total repeat mo-
tifs) confirmed by FISH analysis (Additional file 9). Com-
parison of SSRs Blast-based orthologs in soles (Table 3
[7]. Two species-specific oligo-D
been reported in S. senegalensis an
limited number of unique transcri
number of ESTs available in soles [
was compensated to some extent u
croarrays [49]. The sole transcripto
study have overcome these restrictio
lect sole-specific probes is depicted
5,545 complete non-redundant tran
the 34,291 longest, non-redunda
cripts. Clustering them resulted in
redundant transcripts (Figure 7) tha
13,284 selected “Coding” transcrip
43,303 probes. The final panel of
related to reproduction, cell differ
stress, growth, biosynthetic and cat
port, embryonic development and i
other functions.
The microarray was tested with l
salinities (10 and 36 ppt). Hybrid
tected for 42,469 probes. A total
found differentially expressed (p <
were up-regulated and 175 down-re
pared to 36 ppt. Application of a
(expression ratio) > ±1 filtered 1,48
down-regulated probes. The differe
(DEGs) were involved in osmoregu
porters and the renin-angiotensin
Table 3 SSR summary statistics for whole and reference
transcriptomes
Type of SSR S. senegalensis S. solea
Whole transcriptome 266,434 316,388
Di-nucleotide 107,828 126,260
Tri-nucleotide 96,076 114,198
Tetra-nucleotide 39,102 44,118
Others 23,428 31,812
Reference transcriptome 49,955 67,610
Di-nucleotide 16,405 22,371
Tri-nucleotide 22,394 29,764
Tetra-nucleotide 6,935 8,829
Others 4,221 6,646
Blast-based orthologs 12,418 18,486
Species-specific SSR1
1,273 4,803
Conserved SSR 11,145 13,683
Same repeat motif2
6,596 6,772
Different repeat motif 4,549 6,911
Total number of SSRs and frequency according to their repeat motif
are indicated.
(1)
SSRs present in one species but not in orthologs of the other species.
(2)
Exactly the same SSR repeat motif was found in both orthologs; in a few
cases, SSR occurs once in one ortholog and twice in the other.
USES

BioIn4Next
Overview of descriptions
24

BioIn4Next
Browsing ECs
26
More information about
this enzyme activity

BioIn4Next
Overview of KEGG pathways

BioIn4Next
27
List of S.senegalensis v4.1
enzymes for this pathway

BioIn4Next
27
List of S.senegalensis v4.1
enzymes for this pathway
The complete overview
of this pathway

BioIn4Next
For example: steroid biosynthesis
28

BioIn4Next
Browsing by protein motifs and families
29

BioIn4Next
Study of apolipoprotein A-IV paralogs
30
was then carried out using SEQBOOT (100 replicates) in the PHYLIP
package (Felsenstein, 1989) followed by a Phyml reconstruction (100
replicates) (Guindon and Gascuel, 2003). The consensus phylogenetic
tree was subsequently obtained (CONSENSE). Trees were drawn using
the Figtree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). Accession
numbers for sequences used in the phylogeny are indicated in Supple-
mentary file 1. Putative signal peptide was identified using SignalIP
(http://www.cbs.dtu.dk/services/SignalP/).
Genomic sequences were retrieved after blasting sequences onto a
de novo genome assembly for a female sole using Oases software with
a 51 k-mer (Benzekri et al., unpublished results). To identify intron
and exons boundaries, the two genomic scaffolds containing the apoA-
IV gene cluster sequences were aligned with apoA-IV cDNA sequences
using Seqman software. Also, a blast analysis (blastx) at NCBI was car-
ried out to establish gene synteny and identify other gene coding re-
gions. The two scaffold sequences have been deposited at NCBI/EMBL/
DDBJ with accession numbers LC056058 and LC056059. Synteny analy-
sis was carried out using ensembl (v79.01) and Genomicus genome
browser (http://www.genomicus.biologie.ens.fr/genomicus-79.01/cgi-
development. For apoA-IVAa1 and apoA-IVAa2, the incubation time os-
cillated between 60 and 105 min (depending on the larval stage),
while for apoA-IVBa3 and apoA-IVBa4 a fixed time of 60 min was used
in all stages. In all cases, fasted and fed larvae at 3, 5 and 9 dph were al-
ways managed in parallel and the same time for color development was
given. Twenty animals/sample-treatment/gene were used for each
WISH analysis. Digital images were captured using a Leica DFC290 HD
digital camera attached to a Leica DMIL LED inverted microscope.
2.4. RNA isolation and RT-qPCR analysis
Homogenization of samples, RNA isolation and cDNA synthesis pro-
cedures were carried out as previously described (Armesto et al., 2014,
2015). Real-time analysis was carried out on a CFX96™
Real-Time Sys-
tem (Bio-Rad) using Senegalese sole specific primers for each apoA-IV
transcript (Table 1). Real-time reactions were accomplished in a 10-μL
volume containing cDNA generated from 10 ng of original RNA tem-
plate, 300 nM each of specific forward and reverse primers, and 5 μL
of SYBR Premix Ex Taq (Takara, Clontech). The amplification protocol
Table 1
EST information and primer sequences for apoA-IV paralogs. The total number of ESTs (N) encoding for each paralog found at SoleaDB (v4.1; Benzekri et al., 2014) and the unigene ID
(v3 and v4.1) for sequences used for CDS(*), 5-(†) and 3-UTR (§) identification are indicated. Moreover, Primer sequences used for probe amplification (¥) and qPCR (‡) analysis and their
corresponding amplicons (bp) are also shown.
Paralog SoleaDB N Primer name Primer sequence (5′ ➔ 3)′ Size
apoA-IVAa1 solea_v3.0_unigene29941*
solea_v4.1_unigene546584†§
35 apoa41fc2(‡)
apoa41rc2(‡)
ATGGACCCAGAGGCGCTGAAGACCGTA
GGCCTGCAGCTCATCAGTGCTCTTGT
90(‡)
apoa41_3(¥)
apaa41_4(¥)
GGACAGGAAGTCAATACCAGGATCGCTCA
TAAACAGGAGGTGGAAAGTTGGCTGGAGT
669(¥)
apoA-IVAa2 solea_v4.1_unigene431170*
solea_v4.1_unigene546431_split_0†
solea_v4.1_ unigene 534078§
14 apoA42F(‡)
apoA42R(‡)
CCATGCGCACTCAGGTGGCTCCTC
CCTCGGCATAGGGCTGCAGATTGGT
132(‡)
apoA42_1(¥)
apoA42_2(¥)
CGACAGTCTGAGCTGGGAAAGG
GGCGGCAGCAGGAGAAAATAAC
667(¥)
apoA-IVBa3 solea_v3.0_unigene3621* solea_v4.1_unigene14920†§
24 apoa43_1(‡,¥)
apoa43_R(‡)
GTCCTCGTTGTGCTCGTCCTTGCTGT
CGTGTCCATCACTGGCTTGGGTGCATC
87(‡)
apoa43_2 (¥)
GCCTGCACCTCCTCGATGTATGGGGAA 719(¥)
apoA-IVBa4 solea_v3.0_unigene34222*
solea_v4.1_unigene547274†§
18 SseapoA44_F(‡)
SseapoA44_2(‡, ¥)
AGCTGAGACACAGAGCCAACCTGGTGA
CATTAGCTGGGCTTGGATGTCCTGGGT
107(‡)
SseapoA44_1(¥)
ATGCCAACCTTCTCTATGCGGATCCAC 689(¥)
86 J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
Román-Padilla et al. CBP Part B (2016) 191:84-98
Fig. 4. Phylogenetic relationships among the predicted sequences of Senegalese sole apoA-IV paralogs and the corresponding deduced amino acid sequences from other vertebrates (see
Supplementary file 1) using the Maximum Likelihood method. The apolipoprotein type and taxonomic group (fish or tetrapod) are indicated on the right. Moreover, the clusters A and B as
well as the four subclades (a1–a4) in Acanthopterygii are shown. The apoE sequences were used as outgroup to root tree. Only bootstrap values higher than 50% are indicated on each
branch. The scale for branch length (0.4 substitutions/site) is shown below the tree. Species abbreviations: Sse, Solea senegalensis; Cse, Cynoglossus semilaevis; Gac, Gasterosteus aculeatus;
Tru, Takifugu rubripes; Ame, Astyanax mexicanus; Dre, Danio rerio; Xtr, Xenopus tropicalis; Hsa, Homo sapiens; Rno, Rattus norvegicus; Mmu, Mus musculus; and Gga, Gallus gallus.
89J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
and Acanthopterygii. In the former, two or three species-specific
paralogs can be found within each cluster depending on the species al-
that expression of apoA-IV in YSL could be involved in the efficient mo-
bilization of TAG-rich molecules (throughout the formation of VLDL
Fig. 14. Transcript abundance of apoA-IV paralogs in different tissues of Senegalese sole juveniles. Data are represented in logarithmic scale. Expression values were normalized to those of
18S rRNA. Data were expressed as the mean fold change (mean + SEM, n = 3) from the calibrator group (kidney). Different letters denote tissues that are significantly different from liver
(P b 0.05).
95J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
USES

BioIn4Next
Putative miRNA precursors
31

BioIn4Next
Ready for gene expression and more
32

BioIn4Next
Retrieving SoleaDB by sequence homology

BioIn4Next
Paste your sequence
Or upload your ﬁle
of sequences

BioIn4Next
Paste your sequence
of sequences
Select your
preferred
assemblies

BioIn4Next
Paste your sequence
of sequences
Select your
E-value ﬁlter Select your
preferred
assemblies

BioIn4Next
Retrieving SoleaDB by keywords

BioIn4Next
Soles retained the crystallin genes
Figure 6 Phylogenetic tree of Crybb and Crybb-like proteins in vertebrates. A neighbor-joining tree based on the alignment of vertebrates
Crybb and Crybb-like sequences was built. Species are indicated as Sse (Solea senegalensis), Sso (Solea solea) Dre (Danio rerio), Tni (Tetraodon nigroviridis),
Oni (Oreochromis niloticus), Ola (Oryzia slatipes), Cse (Cynoglossus semilaevis), Xla (Xenopus laevis) and Gga (Gallus gallus; see Additional file 7 for accession
numbers). Solea sequences are indicated according to the transcript name assigned in SoleaDB. Clusters are indicated as arcs of a circle. The tree
obtained was rooted using Xenopus laevis Cryga. Numbers adjacent to nodes indicate percentage bootstrap support; only values larger than 70%
Benzekri et al. BMC Genomics 2014, 15:952 Page 10 of 18
Fish-specific
cristallin?
Fish-specific
cristallin?
Absent in
flatfish
USES

BioIn4Next
Tisochrysis lutea database
36
Tisochrysis lutea
http://www.scbi.uma.es/isochrysisdb/
H. Benzekri (2016)

BioIn4Next
Tisochrysis lutea database
36
Tisochrysis lutea
http://www.scbi.uma.es/isochrysisdb/
Quite similar to
other microphytes
(microalgae)
H. Benzekri (2016)

BioIn4Next
Ruditapes database
37
http://www.scbi.uma.es/ruditapesdb/
H. Benzekri (2016)

BioIn4Next
Ruditapes database
37
http://www.scbi.uma.es/ruditapesdb/
Browsing and contents
similar to SoleaDB
H. Benzekri (2016)

BioIn4Next
Most Ruditapes genes seem to be identiﬁed
38H. Benzekri (2016)
1 Illumina library:  
127 × 106
reads
2 × 75 nt
USES

BioIn4Next
Too many small transcripts 1 Illumina library:  
127 × 106
reads
2 × 75 nt
USES

BioIn4Next
Too many small transcripts 1 Illumina library:  
127 × 106
reads
2 × 75 nt
Unique orthologues: 12 764 (32%)
Ruditapes philippinarum: 9 747 genes
USES

BioIn4Next
Bioinformatics tools based on genomes
39
e production technologies and applications

BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
144 RESULTADOS Y DISCUSIÓN

Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No olvides completarlo
Eliminado:
Eliminado: fueron
Eliminado: dos
RESULTADOS Y DISCUSIÓN

80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2

Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
H. Benzekri (2016)
USES

BioIn4Next
40

Cepas
Referencia a las
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
65 553 (44,1 %) 129 264 (29,8 %)
Eliminado:
Eliminado: fueron
Eliminado: dos

Cepas
L091106-03H (v2) DI21
Número de N 341 126 561 264
N50 2 323 982 2 798 534
N90 157 598 152 634
N50 is provided
by the longest
contig

H. Benzekri (2016)
USES

BioIn4Next
40

Cepas
Referencia a las
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
65 553 (44,1 %) 129 264 (29,8 %)
Eliminado:
Eliminado: fueron
Eliminado: dos

Cepas
L091106-03H (v2) DI21
Número de N 341 126 561 264
N50 2 323 982 2 798 534
N90 157 598 152 634
N50 is provided
by the longest
contig

Both pathogenic
strains are
highly syntenic
H. Benzekri (2016)
USES

BioIn4Next
Photobacterium-DB for browsing genomes
41
http://www.scbi.uma.es/photobacterium_damselae/
H. Benzekri (2016)P. Seoane-Zonjic (2016)

BioIn4Next
Searchable and downloadable
42P. Seoane-Zonjic (2016)

BioIn4Next
Solea senegalensis genome assembling approach
43
2 × 75 nt
Female
3 kb paired-ends
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
H. Benzekri (2016)
Long paired-ends
Female

BioIn4Next
43
2 × 75 nt
Female
3 kb paired-ends
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
H. Benzekri (2016)
Long paired-ends
Female

BioIn4Next
43
2 × 75 nt
Female
3 kb paired-ends
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - GAPcloser
Breaking into
artiﬁcial reads
Final scaffolds 34 176
H. Benzekri (2016)
Long paired-ends
Female

BioIn4Next
43
2 × 75 nt
Female
3 kb paired-ends
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - GAPcloser
Breaking into
artiﬁcial reads
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
H. Benzekri (2016)
Long paired-ends
Female

BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity
H. Benzekri (2016)Manchado et al (2016), in press
USES

BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity

algunos puedan contener zonas del genoma o genes propios al lenguado senegalés que no están (o son
muy diferentes) en Cynoglossus semilaevis.

Figura IV.58: Ejemplo de alineamiento entre el Scaffod 1145 de S. senegalensis y el cromosoma 1 de C.
Semilaevis. Las zonas mostradas tienen un tamaño aproximativo de 150 kb. Se nota que fragmentos alineados se
USES

BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artiﬁcial reads
Longest: 638 263 nt
N50: 85 596 nt
H. Benzekri (2016)

BioIn4Next
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
Breaking into
artiﬁcial reads
Longest: 638 263 nt
N50: 85 596 nt
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)

BioIn4Next
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
Breaking into
artiﬁcial reads
Longest: 638 263 nt
N50: 85 596 nt
8 538 scaffolds
Longest: 638 263 nt
N50: 105 233 nt
Total Length: 466.7 Mbp
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)

BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
Already established
linkage groups
113/129 SSR validated
USES

BioIn4Next
46
New markers
88/113 validated
Already established
linkage groups
USES

BioIn4Next
46
New markers
88/113 validated
Already established
linkage groups
Females lack Chr W
→ XY system?
USES

BioIn4Next
Gene structure and synthey of apolipoproteins A-IV
47
USES
Román-Padilla et al. CBP Part B (2016) 191:84-98
block followed by a long domain containing 9 putative tandem repeats
flanked by the unrelated coding regions (UCR) 1 and 2 (Fig. 2). The com-
mon block was located into the exon 3 (except for apoA-IVAa1 in the
exon 2) and could be divided into the A, B and C segments. Seven out
of the 9 putative tandem repeats were 22-mer in length and contained
ters according the genomic clusters A and B, as described above. In
Ostariophysi, the apoA-IV duplicates within each cluster appeared close-
ly related each other in the same branch indicating a high similarity be-
tween intraspecific paralogs. In contrast, the apoA-IV duplicates within
each cluster in Acanthopterygii could be splitted into two clearly
Fig. 1. Gene structure of the four apoA-IV paralogs in Senegalese sole. The wide bars represent the exons, and thin lines the introns. The wide bars in red represent the 5′ and 3′ untranslated
regions whereas the ORF is shown in blue indicating signal peptides (dark blue) from the mature peptide (light blue). The size of exons and introns is also indicated. Only the length of the
exons is drawn to scale. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
ades (referred to as a1 an a2 for cluster A and a3 and a4
According to this phylogenetic tree, we named each
the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be noted that not all
Acanthopterygii species bear the four paralog types. G. aculeatus lacked
the apoA-IVAa1 and had two apoA-IVAa2-like paralogs (referred to as 1
no acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows indicate the position of in-
ptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repeats (A, B, C) of the common
indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline residues 117, 129 and 183 are
k. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, the reader is referred to the
article.)
J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
separated subclades (referred to as a1 an a2 for cluster A and a3 and a4
for cluster B). According to this phylogenetic tree, we named each
paralog adding the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be
Acanthopterygii species bear the four paralog type
the apoA-IVAa1 and had two apoA-IVAa2-like para
Fig. 2. Deduced amino acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows
trons. The signal peptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repe
33-codon block are indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline resi
indicated by asterisk. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, th
web version of this article.)
Fig. 3. Physical synteny of apoA-IV paralogs. Cluster A. Synteny for apoA-IVAa1 and apoA-IVAa2 paralogs. Cluster B, synteny for apoA-IVBa3 and apoA-IVBa4 paral
the chromosome or scaffold location are indicated on the right. Each gene is represented by a color within each cluster. The coding direction is indicated by the p
indicate non-syntenic genes. “*” in T. rubripes denotes a gene identified by sequence analysis, not available in Genomicus platform “**” indicates an Apo
(ENSDARG00000095050). Gene names: apoC-I, apolipoprotein C-I; apoC-II, apolipoprotein C-II; apo14, apolipoprotein 14 kDa; apoEa and apoEb, apolipoprote
(Asp-Glu-Ala-Asp) box polypeptide 6; lipea, lipase, hormone-sensitive a; mep1b, meprin A, beta; msto1, misato 1, mitochondrial distribution and morphology
nine-rich splicing factor 4; and tomm40, translocase of outer mitochondrial membrane 40 homolog.

BioIn4Next
Genosole: a database for S. senegalensis genome draft
48
http://www.scbi.uma.es/GenoSole/
P. Seoane-Zonjic (2016)
COMING SOON

Rafa
Gonzalo
Rocío
Noé
Darío
49
Gonzalo
Isabel
Elena
Rosario
Pedro
David
P10-CVI-6075
BIO267
RTA2013-00068-C03 
RTA2013-00023-C02
Marina
BioIn4Next
Hicham
M. Manchado
chnologies and applications
aquaculture”
nta María, 20-24 Junio
ntro El Toruño
royecto Algae4A-B

160620 sole nomics v2

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 160620 sole nomics v2

Similar to 160620 sole nomics v2 (20)

More from M. Gonzalo Claros

More from M. Gonzalo Claros (20)

Recently uploaded

Recently uploaded (20)

160620 sole nomics v2