Jet reorientation in central galaxies of clusters and groups: insights from V...
160620 sole nomics v2
1. BioIn4Next
Bioinformatic platforms for the study of
marine organisms
M. Gonzalo Claros
Dpto Biología Molecular y Bioquímica
Plataforma Andaluza de Bioinformática
Universidad de Málaga
P.to
S.ta
M.ª 20-24/6/16
“Microalgae production technologies and applications
to marine fish aquaculture”
El Puerto de Santa María, 20-24 Junio
IFAPA centro El Toruño
1º Seminario del proyecto Algae4A-B
pecialistas en el
onsorcio
http://about.me/mgclaros/
@MGClaros claros@uma.es
10. BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
17. BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
18. BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
19. BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologı´a del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
20. BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologı´a del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
More than one
equivalent tool
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
21. BioIn4Next
Our bioinformatic contribution to aquaculture
9
Transcriptomes
Solea senegalensis
Solea solea
Tisochrysis lutea
Ruditapes decussatus
Genomes
Solea senegalensis
Photobacterium damselae
subsp. piscicida (x2)
SNPs
Mytilus edulis
Crassostrea angulata
Human
food
Human
food
Aquaculture
feed
Human
food
Aquaculture
diseases
Human
food
Human
food
Tetraselmis chuii
22. BioIn4Next
Bioinformatics tools based on
transcriptomes
10
e production technologies and applications
to marine fish aquaculture”
El Puerto de Santa María, 20-24 Junio
IFAPA centro El Toruño
23. BioIn4Next
NGS read pre-processing for 2 sole transcriptomes
11
NGS platform
Illumina 454
Species S. senegalensis S. solea S. senegalensis
Total Input Reads 1,800,249,230 2,101,324,072 5,663,225
mean length 76 100 757
Rejected (total) N 237,941,945 345,251,849 1,562,661
% 13.5 17.1 26.8
by contamination N 144,247,943 226,627,909 156.921
% 8.2 11.2 3.0
Useful reads N 1,561,416,814 1,746,258,741 3,774,412
% 86.7 83.1 67.6
paired reads N 1,503,882,050 1,676,160,406 -
% 83.3 79.5 -
single reads N 57,534,764 70,098,335 3,774,412
% 3.2 3.3 67.6
mean length 66 89 184
Benzekri et al. BMC Genomics 2014, 15:952
26. BioIn4Next
Soles and zebrafish are highly orthologous
14bution of the level of similarity between both sole reference transcriptomes for those transcripts with (dar
C Genomics 2014, 15:952
edcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
Transcripts having a zebrafish
orthologue are more similar
between soles
Transcripts lacking a zebrafish
orthologous still are significantly
homologous between soles
USES
27. BioIn4Next
There are lineage-specific genes in teleosts
15
Likely protein
coding
W/O zebrafish
orthologue
Orthologs between soles of unknown function 137 351
Orthologs in other teleosts proteins:
Gadus morhua 7 155
Oryzias latipes 10 190
Oreochromis niloticus 17 241
Tetraodon nigroviridis 6 198
Gasterosteus aculeatus 17 235
In at least one of these species 27 290
Orthologs in Cynoglossus semilaevis DNA (flatfish) 99 287
Orthologs in teleosts but not in flatfish 3 46
Specific orthologs only in flatfish 75 43
Without ortholog 35 18
Benzekri et al. BMC Genomics 2014, 15:952
sole-specific genes
flatfish-specific
genes
USES
29. BioIn4Next
UNIGENES
S. senegalensis
v3
Complete
6,742
N-terminal
11,268
Internal
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
ncRNA
non-redundant
coding
21,314
Inconsistent
unigenes
51,218
SELECTED unigenes
for microarray
Selectthemost3',non-
redundantunigene
Select,non-redundant
completeunigene
Most 3', non-
redundant
incomplete
unigenes
34,291
Longest, non-
redundant,
complete unigenes
5,545
Selectlongerandnon-redundantunigenes
CD-HIT
Selection of unigenes qualified
as coding and with ORF
ORF-Predictor
Full-LengtherNext
21,099
30,119
Development of microarray and qPCR primers
16Benzekri et al. BMC Genomics 2014, 15:952
Feature selection
algorithm for
microarray printing
microarray provided repetitive and consistent positive
hybridization signals.
Conclusions
De novo transcriptomes of S. solea and S. senegalensis
covering their main developmental stages and organs were
described based on a combined assembly approach that
can be applied to other transcriptomic studies. The huge
volume of reads processed in each species (>1,800 millions,
the highest number of reads reported to date for any or-
ganism) produced a high number of transcripts that were
mined to obtain a representative reference transcriptome
Transcripts
S. senegalensis v3
Complete
6,742
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
non-redundant
coding
21,314
Inconsistent
transcripts
51,218
SELECTED transcripts
for microarray
Selectt
redun
Select,non
complete
Longest, non-
redundant, complete
transcripts
5,545
Selectlongerandnon-r
CD-HIT
as coding and with ORF
ORF-Predictor
21,099
30,119
Figure 7 Schematic representation of the probe selection strategy for the construction of the Senegalese sole oligonucleotide
microarray. The number of transcripts that resulted after the described filtration is indicated.
Table 4 Validation of microarray data using qPCR
Microarray qPCR
SoleaDBcode Gene Gene name FC p-value FC p-value
Unigene18736 Angiotensin I converting enzyme 2 ace2 4.5 <0.001 4.9 <0.05
Unigene49603 Angiotensinogen agt 3.5 <0.01 4.7 <0.05
Unigene39473 Na-K-Cl cotransporter2 nkcc2 2.5 <0.01 3.13 <0.01
Unigene252320 Transferrin tf 15.6 <0.001 10.5 <0.01
Unigene214993 Ferritin fth 2.1 <0.01 2.3 <0.05
Unigene39196 Heat shock protein 90-alpha hsp90aa 2.7 <0.01 2.3 <0.01
Unigene54412 Trypsinogen1a try1 17.6 <0.001 12.0 <0.001
Unigene31826 Trypsinogen2 try2 4.7 <0.001 7.8 <0.05
Unigene53434 Chymotrypsinogen2 ctr2 7.2 <0.001 6.3 <0.05
Unigene52166 Elastase1 cela1 8.7 <0.001 7.8 <0.05
Unigene53593 Elastase4 cela4 7.1 <0.001 4.6 <0.05
Unigene54920 Complement component C3 c3 3.8 <0.05 34.0 <0.05
Unigene53521 Lysozyme g lyg 2.5 <0.05 3.6 <0.05
Unigene219622 Thyroid stimulating hormone, beta tshb 2.5 <0.05 4.6 <0.001
Unigene52404 Transaldolase taldo 2.1 <0.05 2.5 <0.05
Fold-changes (FC) and p-values obtained for target genes by microarray and qPCR are indicated. Moreover, the transcript code in the SoleaDB for S. senegalensis
v3 transcriptome is also shown. For qPCR, data were normalized to those of gapdh2 and referred to the calibrator group (36 ppt 3 DPH).
Microarray
validation
0"
10"
20"
30"
40"
50"
60"
C1qlike" c2" c3" c401" c402" c5" c9" factor"h"
Rela%ve'gene'expression'
Genes'
37'ppt'
10'ppt'
6"
8"
e'expression'
20"
25"
30"
35"
e'expression'
A
B
D
F
E
*
*
* *
*
*
*
*
0"
1"
2"
3"
4"
5"
ptgs1a" ptgs2"
Rela%ve'gene'expression'
Genes'
*
0"
100"
200"
300"
400"
500"
600"
il1b" il11a" il8b"
Rela%ve'gene'expression'
Genes'
*
*
*
*
36. BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
37. BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
Download the
full information
for a subset of
transcripts
38. BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
Download the
full information
for a subset of
transcripts
Download
raw reads
43. BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed transcripts
More specific filtering/searching
Paginated
Included in the
representative transcriptome
46. BioIn4Next
Markers: SNPs and SSRs
22Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed SSRs
47. BioIn4Next
Markers: SNPs and SSRs
22Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed SSRs
48. BioIn4Next
SoleaDB: a huge source molecular markers
23
representation of GATA repeats (<0.2% total repeat mo-
tifs) confirmed by FISH analysis (Additional file 9). Com-
parison of SSRs Blast-based orthologs in soles (Table 3
[7]. Two species-specific oligo-D
been reported in S. senegalensis an
limited number of unique transcri
number of ESTs available in soles [
was compensated to some extent u
croarrays [49]. The sole transcripto
study have overcome these restrictio
lect sole-specific probes is depicted
5,545 complete non-redundant tran
the 34,291 longest, non-redunda
cripts. Clustering them resulted in
redundant transcripts (Figure 7) tha
13,284 selected “Coding” transcrip
43,303 probes. The final panel of
related to reproduction, cell differ
stress, growth, biosynthetic and cat
port, embryonic development and i
other functions.
The microarray was tested with l
salinities (10 and 36 ppt). Hybrid
tected for 42,469 probes. A total
found differentially expressed (p <
were up-regulated and 175 down-re
pared to 36 ppt. Application of a
(expression ratio) > ±1 filtered 1,48
down-regulated probes. The differe
(DEGs) were involved in osmoregu
porters and the renin-angiotensin
Table 3 SSR summary statistics for whole and reference
transcriptomes
Type of SSR S. senegalensis S. solea
Whole transcriptome 266,434 316,388
Di-nucleotide 107,828 126,260
Tri-nucleotide 96,076 114,198
Tetra-nucleotide 39,102 44,118
Others 23,428 31,812
Reference transcriptome 49,955 67,610
Di-nucleotide 16,405 22,371
Tri-nucleotide 22,394 29,764
Tetra-nucleotide 6,935 8,829
Others 4,221 6,646
Blast-based orthologs 12,418 18,486
Species-specific SSR1
1,273 4,803
Conserved SSR 11,145 13,683
Same repeat motif2
6,596 6,772
Different repeat motif 4,549 6,911
Total number of SSRs and frequency according to their repeat motif
are indicated.
(1)
SSRs present in one species but not in orthologs of the other species.
(2)
Exactly the same SSR repeat motif was found in both orthologs; in a few
cases, SSR occurs once in one ortholog and twice in the other.
Benzekri et al. BMC Genomics 2014, 15:952
http://www.biomedcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
USES
49. BioIn4Next
SoleaDB: a huge source molecular markers
23
representation of GATA repeats (<0.2% total repeat mo-
tifs) confirmed by FISH analysis (Additional file 9). Com-
parison of SSRs Blast-based orthologs in soles (Table 3
[7]. Two species-specific oligo-D
been reported in S. senegalensis an
limited number of unique transcri
number of ESTs available in soles [
was compensated to some extent u
croarrays [49]. The sole transcripto
study have overcome these restrictio
lect sole-specific probes is depicted
5,545 complete non-redundant tran
the 34,291 longest, non-redunda
cripts. Clustering them resulted in
redundant transcripts (Figure 7) tha
13,284 selected “Coding” transcrip
43,303 probes. The final panel of
related to reproduction, cell differ
stress, growth, biosynthetic and cat
port, embryonic development and i
other functions.
The microarray was tested with l
salinities (10 and 36 ppt). Hybrid
tected for 42,469 probes. A total
found differentially expressed (p <
were up-regulated and 175 down-re
pared to 36 ppt. Application of a
(expression ratio) > ±1 filtered 1,48
down-regulated probes. The differe
(DEGs) were involved in osmoregu
porters and the renin-angiotensin
Table 3 SSR summary statistics for whole and reference
transcriptomes
Type of SSR S. senegalensis S. solea
Whole transcriptome 266,434 316,388
Di-nucleotide 107,828 126,260
Tri-nucleotide 96,076 114,198
Tetra-nucleotide 39,102 44,118
Others 23,428 31,812
Reference transcriptome 49,955 67,610
Di-nucleotide 16,405 22,371
Tri-nucleotide 22,394 29,764
Tetra-nucleotide 6,935 8,829
Others 4,221 6,646
Blast-based orthologs 12,418 18,486
Species-specific SSR1
1,273 4,803
Conserved SSR 11,145 13,683
Same repeat motif2
6,596 6,772
Different repeat motif 4,549 6,911
Total number of SSRs and frequency according to their repeat motif
are indicated.
(1)
SSRs present in one species but not in orthologs of the other species.
(2)
Exactly the same SSR repeat motif was found in both orthologs; in a few
cases, SSR occurs once in one ortholog and twice in the other.
Benzekri et al. BMC Genomics 2014, 15:952
http://www.biomedcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
USES
58. BioIn4Next
Overview of KEGG pathways
27
List of S.senegalensis v4.1
enzymes for this pathway
Benzekri et al. BMC Genomics 2014, 15:952
59. BioIn4Next
Overview of KEGG pathways
27
List of S.senegalensis v4.1
enzymes for this pathway
The complete overview
of this pathway
Benzekri et al. BMC Genomics 2014, 15:952
64. BioIn4Next
Study of apolipoprotein A-IV paralogs
30
was then carried out using SEQBOOT (100 replicates) in the PHYLIP
package (Felsenstein, 1989) followed by a Phyml reconstruction (100
replicates) (Guindon and Gascuel, 2003). The consensus phylogenetic
tree was subsequently obtained (CONSENSE). Trees were drawn using
the Figtree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). Accession
numbers for sequences used in the phylogeny are indicated in Supple-
mentary file 1. Putative signal peptide was identified using SignalIP
(http://www.cbs.dtu.dk/services/SignalP/).
Genomic sequences were retrieved after blasting sequences onto a
de novo genome assembly for a female sole using Oases software with
a 51 k-mer (Benzekri et al., unpublished results). To identify intron
and exons boundaries, the two genomic scaffolds containing the apoA-
IV gene cluster sequences were aligned with apoA-IV cDNA sequences
using Seqman software. Also, a blast analysis (blastx) at NCBI was car-
ried out to establish gene synteny and identify other gene coding re-
gions. The two scaffold sequences have been deposited at NCBI/EMBL/
DDBJ with accession numbers LC056058 and LC056059. Synteny analy-
sis was carried out using ensembl (v79.01) and Genomicus genome
browser (http://www.genomicus.biologie.ens.fr/genomicus-79.01/cgi-
development. For apoA-IVAa1 and apoA-IVAa2, the incubation time os-
cillated between 60 and 105 min (depending on the larval stage),
while for apoA-IVBa3 and apoA-IVBa4 a fixed time of 60 min was used
in all stages. In all cases, fasted and fed larvae at 3, 5 and 9 dph were al-
ways managed in parallel and the same time for color development was
given. Twenty animals/sample-treatment/gene were used for each
WISH analysis. Digital images were captured using a Leica DFC290 HD
digital camera attached to a Leica DMIL LED inverted microscope.
2.4. RNA isolation and RT-qPCR analysis
Homogenization of samples, RNA isolation and cDNA synthesis pro-
cedures were carried out as previously described (Armesto et al., 2014,
2015). Real-time analysis was carried out on a CFX96™
Real-Time Sys-
tem (Bio-Rad) using Senegalese sole specific primers for each apoA-IV
transcript (Table 1). Real-time reactions were accomplished in a 10-μL
volume containing cDNA generated from 10 ng of original RNA tem-
plate, 300 nM each of specific forward and reverse primers, and 5 μL
of SYBR Premix Ex Taq (Takara, Clontech). The amplification protocol
Table 1
EST information and primer sequences for apoA-IV paralogs. The total number of ESTs (N) encoding for each paralog found at SoleaDB (v4.1; Benzekri et al., 2014) and the unigene ID
(v3 and v4.1) for sequences used for CDS(*), 5-(†) and 3-UTR (§) identification are indicated. Moreover, Primer sequences used for probe amplification (¥) and qPCR (‡) analysis and their
corresponding amplicons (bp) are also shown.
Paralog SoleaDB N Primer name Primer sequence (5′ ➔ 3)′ Size
apoA-IVAa1 solea_v3.0_unigene29941*
solea_v4.1_unigene546584†§
35 apoa41fc2(‡)
apoa41rc2(‡)
ATGGACCCAGAGGCGCTGAAGACCGTA
GGCCTGCAGCTCATCAGTGCTCTTGT
90(‡)
apoa41_3(¥)
apaa41_4(¥)
GGACAGGAAGTCAATACCAGGATCGCTCA
TAAACAGGAGGTGGAAAGTTGGCTGGAGT
669(¥)
apoA-IVAa2 solea_v4.1_unigene431170*
solea_v4.1_unigene546431_split_0†
solea_v4.1_ unigene 534078§
14 apoA42F(‡)
apoA42R(‡)
CCATGCGCACTCAGGTGGCTCCTC
CCTCGGCATAGGGCTGCAGATTGGT
132(‡)
apoA42_1(¥)
apoA42_2(¥)
CGACAGTCTGAGCTGGGAAAGG
GGCGGCAGCAGGAGAAAATAAC
667(¥)
apoA-IVBa3 solea_v3.0_unigene3621* solea_v4.1_unigene14920†§
24 apoa43_1(‡,¥)
apoa43_R(‡)
GTCCTCGTTGTGCTCGTCCTTGCTGT
CGTGTCCATCACTGGCTTGGGTGCATC
87(‡)
apoa43_2 (¥)
GCCTGCACCTCCTCGATGTATGGGGAA 719(¥)
apoA-IVBa4 solea_v3.0_unigene34222*
solea_v4.1_unigene547274†§
18 SseapoA44_F(‡)
SseapoA44_2(‡, ¥)
AGCTGAGACACAGAGCCAACCTGGTGA
CATTAGCTGGGCTTGGATGTCCTGGGT
107(‡)
SseapoA44_1(¥)
ATGCCAACCTTCTCTATGCGGATCCAC 689(¥)
86 J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
Román-Padilla et al. CBP Part B (2016) 191:84-98
Fig. 4. Phylogenetic relationships among the predicted sequences of Senegalese sole apoA-IV paralogs and the corresponding deduced amino acid sequences from other vertebrates (see
Supplementary file 1) using the Maximum Likelihood method. The apolipoprotein type and taxonomic group (fish or tetrapod) are indicated on the right. Moreover, the clusters A and B as
well as the four subclades (a1–a4) in Acanthopterygii are shown. The apoE sequences were used as outgroup to root tree. Only bootstrap values higher than 50% are indicated on each
branch. The scale for branch length (0.4 substitutions/site) is shown below the tree. Species abbreviations: Sse, Solea senegalensis; Cse, Cynoglossus semilaevis; Gac, Gasterosteus aculeatus;
Tru, Takifugu rubripes; Ame, Astyanax mexicanus; Dre, Danio rerio; Xtr, Xenopus tropicalis; Hsa, Homo sapiens; Rno, Rattus norvegicus; Mmu, Mus musculus; and Gga, Gallus gallus.
89J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
and Acanthopterygii. In the former, two or three species-specific
paralogs can be found within each cluster depending on the species al-
that expression of apoA-IV in YSL could be involved in the efficient mo-
bilization of TAG-rich molecules (throughout the formation of VLDL
Fig. 14. Transcript abundance of apoA-IV paralogs in different tissues of Senegalese sole juveniles. Data are represented in logarithmic scale. Expression values were normalized to those of
18S rRNA. Data were expressed as the mean fold change (mean + SEM, n = 3) from the calibrator group (kidney). Different letters denote tissues that are significantly different from liver
(P b 0.05).
95J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
USES
69. BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
70. BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
Select your
preferred
assemblies
71. BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
Select your
E-value filter Select your
preferred
assemblies
74. BioIn4Next
Soles retained the crystallin genes
35Benzekri et al. BMC Genomics 2014, 15:952
Figure 6 Phylogenetic tree of Crybb and Crybb-like proteins in vertebrates. A neighbor-joining tree based on the alignment of vertebrates
Crybb and Crybb-like sequences was built. Species are indicated as Sse (Solea senegalensis), Sso (Solea solea) Dre (Danio rerio), Tni (Tetraodon nigroviridis),
Oni (Oreochromis niloticus), Ola (Oryzia slatipes), Cse (Cynoglossus semilaevis), Xla (Xenopus laevis) and Gga (Gallus gallus; see Additional file 7 for accession
numbers). Solea sequences are indicated according to the transcript name assigned in SoleaDB. Clusters are indicated as arcs of a circle. The tree
obtained was rooted using Xenopus laevis Cryga. Numbers adjacent to nodes indicate percentage bootstrap support; only values larger than 70%
Benzekri et al. BMC Genomics 2014, 15:952 Page 10 of 18
http://www.biomedcentral.com/1471-2164/15/952
Fish-specific
cristallin?
Fish-specific
cristallin?
Absent in
flatfish
USES
79. BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
1 Illumina library:
127 × 106
reads
2 × 75 nt
USES
80. BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
Too many small transcripts 1 Illumina library:
127 × 106
reads
2 × 75 nt
USES
81. BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
Too many small transcripts 1 Illumina library:
127 × 106
reads
2 × 75 nt
Unique orthologues: 12 764 (32%)
Ruditapes philippinarum: 9 747 genes
USES
82. BioIn4Next
Bioinformatics tools based on genomes
39
e production technologies and applications
to marine fish aquaculture”
El Puerto de Santa María, 20-24 Junio
IFAPA centro El Toruño
83. BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
144 RESULTADOS Y DISCUSIÓN
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No olvides completarlo
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
152 RESULTADOS Y DISCUSIÓN
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
H. Benzekri (2016)
USES
84. BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
144 RESULTADOS Y DISCUSIÓN
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No olvides completarlo
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
N50 is provided
by the longest
contig
152 RESULTADOS Y DISCUSIÓN
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
H. Benzekri (2016)
USES
85. BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
144 RESULTADOS Y DISCUSIÓN
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No olvides completarlo
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
N50 is provided
by the longest
contig
152 RESULTADOS Y DISCUSIÓN
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
Both pathogenic
strains are
highly syntenic
H. Benzekri (2016)
USES
95. BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity
164 RESULTADOS Y DISCUSIÓN
algunos puedan contener zonas del genoma o genes propios al lenguado senegalés que no están (o son
muy diferentes) en Cynoglossus semilaevis.
Figura IV.58: Ejemplo de alineamiento entre el Scaffod 1145 de S. senegalensis y el cromosoma 1 de C.
Semilaevis. Las zonas mostradas tienen un tamaño aproximativo de 150 kb. Se nota que fragmentos alineados se
H. Benzekri (2016)Manchado et al (2016), in press
USES
96. BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
H. Benzekri (2016)
97. BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)
98. BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
8 538 scaffolds
Longest: 638 263 nt
Mean length: 54 673 nt
N50: 105 233 nt
Total Length: 466.7 Mbp
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)
99. BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
100. BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
101. BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
New markers
88/113 validated
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
102. BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
New markers
88/113 validated
Already established
linkage groups
113/129 SSR validated
Females lack Chr W
→ XY system?
H. Benzekri (2016)Manchado et al (2016), in press
USES
103. BioIn4Next
Gene structure and synthey of apolipoproteins A-IV
47
USES
Román-Padilla et al. CBP Part B (2016) 191:84-98
block followed by a long domain containing 9 putative tandem repeats
flanked by the unrelated coding regions (UCR) 1 and 2 (Fig. 2). The com-
mon block was located into the exon 3 (except for apoA-IVAa1 in the
exon 2) and could be divided into the A, B and C segments. Seven out
of the 9 putative tandem repeats were 22-mer in length and contained
ters according the genomic clusters A and B, as described above. In
Ostariophysi, the apoA-IV duplicates within each cluster appeared close-
ly related each other in the same branch indicating a high similarity be-
tween intraspecific paralogs. In contrast, the apoA-IV duplicates within
each cluster in Acanthopterygii could be splitted into two clearly
Fig. 1. Gene structure of the four apoA-IV paralogs in Senegalese sole. The wide bars represent the exons, and thin lines the introns. The wide bars in red represent the 5′ and 3′ untranslated
regions whereas the ORF is shown in blue indicating signal peptides (dark blue) from the mature peptide (light blue). The size of exons and introns is also indicated. Only the length of the
exons is drawn to scale. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
ades (referred to as a1 an a2 for cluster A and a3 and a4
According to this phylogenetic tree, we named each
the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be noted that not all
Acanthopterygii species bear the four paralog types. G. aculeatus lacked
the apoA-IVAa1 and had two apoA-IVAa2-like paralogs (referred to as 1
no acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows indicate the position of in-
ptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repeats (A, B, C) of the common
indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline residues 117, 129 and 183 are
k. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, the reader is referred to the
article.)
J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
separated subclades (referred to as a1 an a2 for cluster A and a3 and a4
for cluster B). According to this phylogenetic tree, we named each
paralog adding the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be
Acanthopterygii species bear the four paralog type
the apoA-IVAa1 and had two apoA-IVAa2-like para
Fig. 2. Deduced amino acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows
trons. The signal peptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repe
33-codon block are indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline resi
indicated by asterisk. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, th
web version of this article.)
Fig. 3. Physical synteny of apoA-IV paralogs. Cluster A. Synteny for apoA-IVAa1 and apoA-IVAa2 paralogs. Cluster B, synteny for apoA-IVBa3 and apoA-IVBa4 paral
the chromosome or scaffold location are indicated on the right. Each gene is represented by a color within each cluster. The coding direction is indicated by the p
indicate non-syntenic genes. “*” in T. rubripes denotes a gene identified by sequence analysis, not available in Genomicus platform “**” indicates an Apo
(ENSDARG00000095050). Gene names: apoC-I, apolipoprotein C-I; apoC-II, apolipoprotein C-II; apo14, apolipoprotein 14 kDa; apoEa and apoEb, apolipoprote
(Asp-Glu-Ala-Asp) box polypeptide 6; lipea, lipase, hormone-sensitive a; mep1b, meprin A, beta; msto1, misato 1, mitochondrial distribution and morphology
nine-rich splicing factor 4; and tomm40, translocase of outer mitochondrial membrane 40 homolog.
104. BioIn4Next
Genosole: a database for S. senegalensis genome draft
48
http://www.scbi.uma.es/GenoSole/
P. Seoane-Zonjic (2016)
COMING SOON