SlideShare a Scribd company logo
1 of 106
Download to read offline
BioIn4Next
Bioinformatic platforms for the study of
marine organisms
M. Gonzalo Claros
Dpto Biología Molecular y Bioquímica
Plataforma Andaluza de Bioinformática
Universidad de Málaga
P.to
S.ta
M.ª 20-24/6/16
“Microalgae	production	technologies	and	applications	
to	marine	fish	aquaculture”
El	Puerto	de	Santa	María,	20-24	Junio
IFAPA	centro	El	Toruño
1º	Seminario	del	proyecto	Algae4A-B
pecialistas	 en	el	
onsorcio	
http://about.me/mgclaros/
@MGClaros claros@uma.es
BioIn4Next
Acuiculture is becoming a key source of food
2
BioIn4Next
Acuiculture is becoming a key source of food
2
BioIn4Next
Acuiculture is becoming a key source of food
2
All of them are non-model organisms
BioIn4Next
Non-model organisms: our expertise
3
http://www.scbi.uma.es/sustainpinedb/
http://www.juntadeandalucia.es/
agriculturaypesca/ifapa/soleadb_ifapa/
http://reprolive.eez.csic.es/
http://www.scbi.uma.es/pgc/
http://mejgenvegetal.uco.es/fgb2/gbrowse/Ca/
CicerDB
BioIn4Next
Combinatory strategy
4
BioIn4Next
Combinatory strategy
4
None is
the best
BioIn4Next
Combinatory strategy
4
None is
the best
The best result is obtained combining at least
two different tools for the same analysis
BioIn4Next
Picasso: SuperComputing & BioInformatics @ UMA
5
Hard disks
7 FAT
nodes
Computing
nodes
THIN
nodes
More disks
GPU
nodes
768 cores
3 TB RAM
8 GB/core
80 cores
2 TB RAM
>25 GB/core
32 GPU
1 TB RAM
8 GB/core
984 cores
4 TB RAM
4 GB/core
Picasso: 

2310 cores
700 TB disk
BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
SOFTWARE Open Access
SeqTrim: a high-throughput pipeline for
pre-processing any type of sequence read
Juan Falgueras1
, Antonio J Lara2
, Noé Fernández-Pozo3
, Francisco R Cantón3
, Guillermo Pérez-Trabado2,4
,
M Gonzalo Claros2,3*
Abstract
Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing
data. This requires increasing sequence quality and reliability in order to avoid database contamination with
artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-
processing algorithms.
Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-
published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality,
vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of
several input and output formats allows its inclusion in sequence processing workflows. Due to its specific
algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It
performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing
reads and does not lead to over-trimming.
Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including
next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know
what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual
sequence if desired. The recommended pipeline reveals more information about each sequence than previously
described pre-processors and can discard more sequencing or experimental artefacts.
Background
Sequencing projects and Expressed Sequence Tags
(ESTs) are essential for gene discovery, mapping, func-
tional genomics and for future efforts in genome anno-
tations, which include identification of novel genes, gene
location, polymorphisms and even intron-exon bound-
aries. The availability of high-throughput automated
sequencing has enabled an exponential growth rate of
sequence data, although not always with the desired
quality. This exponential growth is enhanced by the so
called “next-generation sequencing”, and efforts have to
be made in order to increase the quality and reliability
of sequences incorporated into databases: up to 0.4% of
sequences in nucleotide databases contain contaminant
sequences [1,2]. The situation is even worse in the EST
databases, where vector contamination rate reach 1.63%
of sequences [3]. Hence, improved and user friendly
bioinformatic tools are required to produce more reli-
able high-throughput pre-processing methods.
Pre-processing includes filtering of low-quality
sequences, identification of specific features (such as
poly-A or poly-T tails, terminal transferase tails, and
adaptors), removal of contaminant sequences (from vec-
tor to any other artefacts) and trimming the undesired
segments. There are some bioinformatic tools that can
accomplish individual pre-processing aspects (e.g. Trim-
Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4],
crossmatch, Figaro [5]), and other programs that cope
with the complete pre-processing pipeline such as
PreGap4 [6] or the broadly used tools Lucy [7,8] and
SeqClean [9]. Most of these require installation, are dif-
ficult to configure, environment-specific, or focused on
specific needs (like a design only for ESTs), or require a
change in implementation and design of either the pro-
gram or the protocols within the laboratory itself.
* Correspondence: claros@uma.es
2
Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071
Málaga, Spain
Falgueras et al. BMC Bioinformatics 2010, 11:38
http://www.biomedcentral.com/1471-2105/11/38
© 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
SOFTWARE Open Access
SeqTrim: a high-throughput pipeline for
pre-processing any type of sequence read
Juan Falgueras1
, Antonio J Lara2
, Noé Fernández-Pozo3
, Francisco R Cantón3
, Guillermo Pérez-Trabado2,4
,
M Gonzalo Claros2,3*
Abstract
Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing
data. This requires increasing sequence quality and reliability in order to avoid database contamination with
artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-
processing algorithms.
Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-
published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality,
vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of
several input and output formats allows its inclusion in sequence processing workflows. Due to its specific
algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It
performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing
reads and does not lead to over-trimming.
Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including
next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know
what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual
sequence if desired. The recommended pipeline reveals more information about each sequence than previously
described pre-processors and can discard more sequencing or experimental artefacts.
Background
Sequencing projects and Expressed Sequence Tags
(ESTs) are essential for gene discovery, mapping, func-
tional genomics and for future efforts in genome anno-
tations, which include identification of novel genes, gene
location, polymorphisms and even intron-exon bound-
aries. The availability of high-throughput automated
sequencing has enabled an exponential growth rate of
sequence data, although not always with the desired
quality. This exponential growth is enhanced by the so
called “next-generation sequencing”, and efforts have to
be made in order to increase the quality and reliability
of sequences incorporated into databases: up to 0.4% of
sequences in nucleotide databases contain contaminant
sequences [1,2]. The situation is even worse in the EST
databases, where vector contamination rate reach 1.63%
of sequences [3]. Hence, improved and user friendly
bioinformatic tools are required to produce more reli-
able high-throughput pre-processing methods.
Pre-processing includes filtering of low-quality
sequences, identification of specific features (such as
poly-A or poly-T tails, terminal transferase tails, and
adaptors), removal of contaminant sequences (from vec-
tor to any other artefacts) and trimming the undesired
segments. There are some bioinformatic tools that can
accomplish individual pre-processing aspects (e.g. Trim-
Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4],
crossmatch, Figaro [5]), and other programs that cope
with the complete pre-processing pipeline such as
PreGap4 [6] or the broadly used tools Lucy [7,8] and
SeqClean [9]. Most of these require installation, are dif-
ficult to configure, environment-specific, or focused on
specific needs (like a design only for ESTs), or require a
change in implementation and design of either the pro-
gram or the protocols within the laboratory itself.
* Correspondence: claros@uma.es
2
Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071
Málaga, Spain
Falgueras et al. BMC Bioinformatics 2010, 11:38
http://www.biomedcentral.com/1471-2105/11/38
© 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
BioIn4Next
Our bioinformatic algorithms for non-model organims
6
Raw
short reads
SeqTrimNext
(pre-processing)
Oases
(pre-assembling)
kmer 23 & 47
paired-end + single
CD-HIT
99%
Miss-assembly
rejection
#2 Rejected
Raw
long-reads
SeqTrimNext
(pre-processing)
MIRA
(pre-assembling)
EULER-SR
(pre-assembling)
CAP3
(reconciliation)
Unmapped
contigs
Better
transcriptome
Mapped
contigs
Contigs
Debris
Non-coding
Coding
unmapped
contigs
BOWTIE 2
(mapping test)
#2 Rejected
Full-LengtherNext
Missassemblies
Contigs
SOFTWARE Open Access
SeqTrim: a high-throughput pipeline for
pre-processing any type of sequence read
Juan Falgueras1
, Antonio J Lara2
, Noé Fernández-Pozo3
, Francisco R Cantón3
, Guillermo Pérez-Trabado2,4
,
M Gonzalo Claros2,3*
Abstract
Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing
data. This requires increasing sequence quality and reliability in order to avoid database contamination with
artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-
processing algorithms.
Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-
published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality,
vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of
several input and output formats allows its inclusion in sequence processing workflows. Due to its specific
algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It
performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing
reads and does not lead to over-trimming.
Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including
next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know
what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual
sequence if desired. The recommended pipeline reveals more information about each sequence than previously
described pre-processors and can discard more sequencing or experimental artefacts.
Background
Sequencing projects and Expressed Sequence Tags
(ESTs) are essential for gene discovery, mapping, func-
tional genomics and for future efforts in genome anno-
tations, which include identification of novel genes, gene
location, polymorphisms and even intron-exon bound-
aries. The availability of high-throughput automated
sequencing has enabled an exponential growth rate of
sequence data, although not always with the desired
quality. This exponential growth is enhanced by the so
called “next-generation sequencing”, and efforts have to
be made in order to increase the quality and reliability
of sequences incorporated into databases: up to 0.4% of
sequences in nucleotide databases contain contaminant
sequences [1,2]. The situation is even worse in the EST
databases, where vector contamination rate reach 1.63%
of sequences [3]. Hence, improved and user friendly
bioinformatic tools are required to produce more reli-
able high-throughput pre-processing methods.
Pre-processing includes filtering of low-quality
sequences, identification of specific features (such as
poly-A or poly-T tails, terminal transferase tails, and
adaptors), removal of contaminant sequences (from vec-
tor to any other artefacts) and trimming the undesired
segments. There are some bioinformatic tools that can
accomplish individual pre-processing aspects (e.g. Trim-
Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4],
crossmatch, Figaro [5]), and other programs that cope
with the complete pre-processing pipeline such as
PreGap4 [6] or the broadly used tools Lucy [7,8] and
SeqClean [9]. Most of these require installation, are dif-
ficult to configure, environment-specific, or focused on
specific needs (like a design only for ESTs), or require a
change in implementation and design of either the pro-
gram or the protocols within the laboratory itself.
* Correspondence: claros@uma.es
2
Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071
Málaga, Spain
Falgueras et al. BMC Bioinformatics 2010, 11:38
http://www.biomedcentral.com/1471-2105/11/38
© 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
More than one
equivalent tool
BioIn4Next
Choosing the best assembling in non-model organisms
7
BioIn4Next
Choosing the best assembling in non-model organisms
7
1
2
BioIn4Next
Choosing the best assembling in non-model organisms
7
1
2
Weighted PCA analysis
BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologı´a del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
BioIn4Next
Transcriptome annotation for non-model organisms
8
Better
transcriptome
Full-LengtherNext
(including user
database)
Artefacts &
chimeras
Useful
transcripts
Sma3s
MREPS
AutoFact
FullLengtherNext
(including TAIR &
RefSeq)
Transcript
DESCRIPTION
Transcript MODEL
ORTHOLOGUE
Transcript SSRs
DESCRIPTION,
GO, EC, KEGG
pathway, InterPro
Transcript ORF, STATUS
&
REFERENCE
TRANSCRIPTOME
OPT
ANNOTATED transcriptome ready to import in a database
Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets
ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4,
and ANTONIO J. Pe´rez-Pulido5,*
Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1
;
Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2
; Molecular
Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3
; Computer Architecture
Department, University of Ma´laga, Campus de Teatinos, Spain4
and Centro Andaluz de Biologı´a del Desarrollo (CABD,
UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013,
Spain5
*To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376.
E-mail: ajperez@upo.es
Edited by Prof. Kenta Nakai
(Received 29 October 2013; accepted 6 January 2014)
Abstract
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract
information from large collections of sequence data. Most existing tools use sequence homology to establish
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of
robust automatic annotation techniques that generate high-quality annotations without needing expert
knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s
is composed of three modules that progressively annotate query sequences using either: (i) very similar
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We
trained the system using several random sets of known sequences, demonstrating average sensitivityand spe-
cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide
variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms,
and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s
has already been used in the functional annotation of two published transcriptomes.
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool
1. Introduction
Sequenceannotationistheprocessofassociatingbio-
logicalinformationtosequencesofinterest.Annotations
can include the potential function, cellular localization,
biological process or protein structure of a given se-
quence.1
Some sequences are annotated using direct ex-
perimental evidence, but most annotations are inferred
from sequence similarities or conserved patterns asso-
ciated with known characteristics.2–5
Large publically
accessible databases of annotated sequences make it
possible to automatically annotate large collections of
unknown sequences. This is especially valuable for the
interpretation of large sequence datasets generated by
genome and expressed sequence tag (EST) sequencing
projects as well as gene and protein expression experi-
ments, such as DNA microarrays, and many other emer-
ging research areas.6
Sequence annotation is also important in transcrip-
tomic experiments that aim to identify gene clusters
with similarexpression patternsthat are linked to a par-
ticular biological process or experimental condition.
Biological function can then be inferred from annota-
tions shared within these clusters.7
# The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/
.0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com.
DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001
Advance Access publication on 5 February 2014
4
byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom
More than one
equivalent tool
A Web Tool to Discover Full-Length
Sequences: Full-Lengther
Antonio J Lara1
, Guillermo P´erez-Trabado2
, David P Villalobos1
,
Sara D´ıaz-Moreno1
, Francisco R Cant´on1
, and M Gonzalo Claros3
1
Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario
de Teatinos, E-29071 M´alaga, Spain,
2
Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos,
E-29071 M´alaga, Spain,
3
Departamento de Biolog´ıa Molecular y Bioqu´ımica
Facultad de Ciencias Universidad de M´alaga
29071 M´alaga (Spain)
Tel: +34 95 213 72 84
Fax: +34 95 213 20 41
E-mail: claros@uma.es
Summary. Many Expressed Sequence Tags (EST) sequencing projects produce
thousands of sequences that must be cleaned and annotated. Here it is presented
Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST
data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro-
tein database such as UniProt. Blast alignments will guide to locate protein coding
regions, mainly the start codon. Full-Lengther contains an ORF prediction algo-
rithm for those cases that do not deploy any alignment in the BLAST output. The
algorithm is implemented as a web tool to simplify its use and portability. This can
be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
1 Introduction
New biological technology produces a large amount of sequences in form of
ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an-
notated to uncover, for example, its funtion. Currently, the task of annotating
EST sequences does not keep pace with the rate at which they are gener-
ated [1] since:
1. EST sequence annotation is computationally intensive and often returns
no results;
2. EST data suffers from inconsistency problems (error rate, contaminant
sequences, low complexity regions, etc.);
3. gene identification programs perform inconsistently as they are sensitive
to errors.
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*
1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga,
Spain
2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080
Cordoba, Spain
3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain
4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain
* Corresponding author
Manuel Gonzalo Claros Díaz
Departamento de Biología Molecular y Bioquímica,
Facultad de Ciencias, Universidad de Málaga,
E-29071, Malaga (Spain)
Fax: +34 95 213 20 41
Tel: +34 95 213 72 84
E-mail: claros@uma.es
Recycling
Full-LengtherNext: A tool for characterisation and testing
de novo transcriptome assemblies of non-model organisms
Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero-
Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
BioIn4Next
Our bioinformatic contribution to aquaculture
9
Transcriptomes
Solea senegalensis
Solea solea
Tisochrysis lutea
Ruditapes decussatus
Genomes
Solea senegalensis
Photobacterium damselae
subsp. piscicida (x2)
SNPs
Mytilus edulis
Crassostrea angulata
Human
food
Human
food
Aquaculture
feed
Human
food
Aquaculture
diseases
Human
food
Human
food
Tetraselmis chuii
BioIn4Next
Bioinformatics tools based on
transcriptomes
10
e	production	technologies	and	applications	
to	marine	fish	aquaculture”
El	Puerto	de	Santa	María,	20-24	Junio
IFAPA	centro	El	Toruño
BioIn4Next
NGS read pre-processing for 2 sole transcriptomes
11
NGS platform
Illumina 454
Species S. senegalensis S. solea S. senegalensis
Total Input Reads 1,800,249,230 2,101,324,072 5,663,225
mean length 76 100 757
Rejected (total) N 237,941,945 345,251,849 1,562,661
% 13.5 17.1 26.8
by contamination N 144,247,943 226,627,909 156.921
% 8.2 11.2 3.0
Useful reads N 1,561,416,814 1,746,258,741 3,774,412
% 86.7 83.1 67.6
paired reads N 1,503,882,050 1,676,160,406 -
% 83.3 79.5 -
single reads N 57,534,764 70,098,335 3,774,412
% 3.2 3.3 67.6
mean length 66 89 184
Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Overview of the two sole transcriptomes
12
S. senegalensis S. solea
v3 v4 v1
Unigenes % Unigenes % Unigenes %
Total 252,416 100.00 % 697,124 100.00% 531,463 100.00%
>500pb 37,593 14.90 % 156,083 22.24% 165,860 31.22%
>200pb 168,914 66.92 % 385,411 54.92% 338,967 63.89%
Longest unigene 6,050 - 40,163 - 68,559 -
Misassembled 18 0.01 % 215 0.03% 116 0.02%
Putative chimera 984 0.39 % 6,345 0.91% 9,447 1.80%
Unigene report
With an orthologue 1
81,348 32.23 % 147,536 21.74% 121,696 22.90%
Different orthologue IDs 41,792 51.37 % 45,063 30.87% 38,402 31.56%
Complete ORFs 6,742 8.31 % 39,727 26.12% 52,051 42.77%
Different, complete ORFs 4,376 5.38 % 18,738 12.34% 22,683 18.64%
C-terminus 14,757 18.14 % 27,080 17.94% 19,579 16.09%
N-terminus 11,298 13.88 % 27,638 18,52% 25,131 20.65%
Internal 47,529 58.43% 53,091 37.42% 24,935 20.49%
Putative ncRNA 539 0.21 % 1,252 0.18% 1,075 0.20%
Without orthologue 1
171,067 67.56 % 545,491 78.08% 408,692 76.90%
Putative New Genes 22,612 13,21 % 39,812 7,49% 34,194 8,37%
Non-redundant put. new genes nc – 14,451 2,51% 14,528 3.55%
Unknown 147,916 86.48 % 506,679 92.51% 374,498 91.63%
Reference transcriptome nc – 59,514 8.85% 54,005 10.16%
Only
454
Only
Illumina
454 +
Illumina
Very useful
Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Soles are transcriptomically similar
13
0%#
7%#
0%#
9%# 1%#
0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#
2%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#senegalensis#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Response#to#s5mulus#
Reproduc5on#
Pigmenta5on#
Mul5cellular#organismal#process#
Mul5Aorganism#process#
Metabolic#process#
Locomo5on#
Localiza5on#
Immune#system#process#
Growth#
Biological#adhesion#
Biological#regula5on#
Cell#prolifera5on#
0%#
7%#
0%#
9%# 1%# 0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#1%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#solea#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Response#to#s5mulus#
Reproduc5on#
Pigmenta5on#
Mul5cellular#organismal#process#
Mul5Aorganism#process#
Metabolic#process#
Locomo5on#
Localiza5on#
Immune#system#process#
Growth#
Biological#adhesion#
Biological#regula5on#
Cell#prolifera5on#
Biogenesis#
Cellular#process#
Death#
Developmentl#process#
7%#
4%#
4%# 1%#
3%#
5%#
4%#
1%#
0%#
30%#
0%#
41%#
S.#senegalensis#
Transporter#ac3vity#
Structural#molecule#ac3vity#
Receptor#ac3vity#
Protein#binding#trasncrip3on#factor#ac3vity#
Nucleic#acid#binding#transcrip3on#factor#ac3vity#
Molecular#transducer#ac3vity#
Enzyme#regulator#ac3vity#
Electron#carrier#ac3vity#
Channel#regulator#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
3%#
4%# 1%#
3%#
4%#
5%#
1%#
30%#
42%#
S.#solea#
Transporter#ac3vity#
Structural#molecule#ac3vity#
Receptor#ac3vity#
Protein#binding#trasncrip3on#factor#ac3vity#
Nucleic#acid#binding#transcrip3on#factor#ac3vity#
Molecular#transducer#ac3vity#
Enzyme#regulator#ac3vity#
Electron#carrier#ac3vity#
Channel#regulator#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
4%#
4%# 1%#
3%#
5%#
4%#
1%#
0%#
30%#
0%#
41%#
S.#senegalensis#
Transporter#ac3vity#
Structural#molecule#ac3vity#
Receptor#ac3vity#
Protein#binding#trasncrip3on#factor#ac3vity#
Nucleic#acid#binding#transcrip3on#factor#ac3vity#
Molecular#transducer#ac3vity#
Enzyme#regulator#ac3vity#
Electron#carrier#ac3vity#
Channel#regulator#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
7%#
3%#
4%# 1%#
3%#
4%#
5%#
1%#
30%#
42%#
S.#solea#
Transporter#ac3vity#
Structural#molecule#ac3vity#
Receptor#ac3vity#
Protein#binding#trasncrip3on#factor#ac3vity#
Nucleic#acid#binding#transcrip3on#factor#ac3vity#
Molecular#transducer#ac3vity#
Enzyme#regulator#ac3vity#
Electron#carrier#ac3vity#
Channel#regulator#ac3vity#
Cataly3c#ac3vity#
An3oxidant#ac3vity#
Binding#
0%#
7%#
0%#
9%# 1%#
0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#
2%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#senegalensis#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Response#to#s5mulus#
Reproduc5on#
Pigmenta5on#
Mul5cellular#organismal#process#
Mul5Aorganism#process#
Metabolic#process#
Locomo5on#
Localiza5on#
Immune#system#process#
Growth#
Biological#adhesion#
Biological#regula5on#
Cell#prolifera5on#
0%#
7%#
0%#
9%# 1%# 0%#
7%#
1%#
15%#
2%#
6%#
2%#1%#1%#
12%#
1%#
6%#
22%#
1%#
6%#
S.#solea#
Viral#reproduc5on#
Signaling#
Rhythmic#process#
Response#to#s5mulus#
Reproduc5on#
Pigmenta5on#
Mul5cellular#organismal#process#
Mul5Aorganism#process#
Metabolic#process#
Locomo5on#
Localiza5on#
Immune#system#process#
Growth#
Biological#adhesion#
Biological#regula5on#
Cell#prolifera5on#
Biogenesis#
Cellular#process#
Death#
Developmentl#process#
2%#
22%#
4%#
16%#
36%#
2%#
1%#
3%#
14%#
S.#senegalensis#
Synapse#
Organelle#
Membrane6enclosed#lumen#
Membrane#
Cell#
Cell#junc=on#
Extracellular#matrix#
Extracellular#region#
Macromolecular#complex#
1%#
22%#
4%#
16%#
36%#
2%#
2%#
3%#
14%#
S.#solea#
Synapse#
Organelle#
Membrane6enclosed#lumen#
Membrane#
Cell#
Cell#junc=on#
Extracellular#matrix#
Extracellular#region#
Macromolecular#complex#
2%#
22%#
4%#
16%#
36%#
2%#
1%#
3%#
14%#
S.#senegalensis#
Synapse#
Organelle#
Membrane6enclosed#lumen#
Membrane#
Cell#
Cell#junc=on#
Extracellular#matrix#
Extracellular#region#
Macromolecular#complex#
1%#
22%#
4%#
16%#
36%#
2%#
2%#
3%#
14%#
S.#solea#
Synapse#
Organelle#
Membrane6enclosed#lumen#
Membrane#
Cell#
Cell#junc=on#
Extracellular#matrix#
Extracellular#region#
Macromolecular#complex#
A
B
C
S. senegalensis S. solea
Benzekri et al. BMC Genomics 2014, 15:952
Biological process
Cellular component
Molecular function
USES
BioIn4Next
Soles and zebrafish are highly orthologous
14bution of the level of similarity between both sole reference transcriptomes for those transcripts with (dar
C Genomics 2014, 15:952
edcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
Transcripts having a zebrafish
orthologue are more similar
between soles
Transcripts lacking a zebrafish
orthologous still are significantly
homologous between soles
USES
BioIn4Next
There are lineage-specific genes in teleosts
15
Likely protein
coding
W/O zebrafish
orthologue
Orthologs between soles of unknown function 137 351
Orthologs in other teleosts proteins:
Gadus morhua 7 155
Oryzias latipes 10 190
Oreochromis niloticus 17 241
Tetraodon nigroviridis 6 198
Gasterosteus aculeatus 17 235
In at least one of these species 27 290
Orthologs in Cynoglossus semilaevis DNA (flatfish) 99 287
Orthologs in teleosts but not in flatfish 3 46
Specific orthologs only in flatfish 75 43
Without ortholog 35 18
Benzekri et al. BMC Genomics 2014, 15:952
sole-specific genes
flatfish-specific
genes
USES
BioIn4Next
UNIGENES
S. senegalensis
v3
Complete
6,742
N-terminal
11,268
Internal
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
ncRNA
non-redundant
coding
21,314
Inconsistent
unigenes
51,218
SELECTED unigenes
for microarray
Selectthemost3',non-
redundantunigene
Select,non-redundant
completeunigene
Most 3', non-
redundant
incomplete
unigenes
34,291
Longest, non-
redundant,
complete unigenes
5,545
Selectlongerandnon-redundantunigenes
CD-HIT
Selection of unigenes qualified
as coding and with ORF
ORF-Predictor
Full-LengtherNext
21,099
30,119
Development of microarray and qPCR primers
16Benzekri et al. BMC Genomics 2014, 15:952
Feature selection
algorithm for
microarray printing
BioIn4Next
UNIGENES
S. senegalensis
v3
Complete
6,742
N-terminal
11,268
Internal
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
ncRNA
non-redundant
coding
21,314
Inconsistent
unigenes
51,218
SELECTED unigenes
for microarray
Selectthemost3',non-
redundantunigene
Select,non-redundant
completeunigene
Most 3', non-
redundant
incomplete
unigenes
34,291
Longest, non-
redundant,
complete unigenes
5,545
Selectlongerandnon-redundantunigenes
CD-HIT
Selection of unigenes qualified
as coding and with ORF
ORF-Predictor
Full-LengtherNext
21,099
30,119
Development of microarray and qPCR primers
16Benzekri et al. BMC Genomics 2014, 15:952
Feature selection
algorithm for
microarray printing
microarray provided repetitive and consistent positive
hybridization signals.
Conclusions
De novo transcriptomes of S. solea and S. senegalensis
covering their main developmental stages and organs were
described based on a combined assembly approach that
can be applied to other transcriptomic studies. The huge
volume of reads processed in each species (>1,800 millions,
the highest number of reads reported to date for any or-
ganism) produced a high number of transcripts that were
mined to obtain a representative reference transcriptome
Transcripts
S. senegalensis v3
Complete
6,742
47,259
C-terminal
14,757
Coding
22,612
With ORF
22,612
non-redundant
coding
21,314
Inconsistent
transcripts
51,218
SELECTED transcripts
for microarray
Selectt
redun
Select,non
complete
Longest, non-
redundant, complete
transcripts
5,545
Selectlongerandnon-r
CD-HIT
as coding and with ORF
ORF-Predictor
21,099
30,119
Figure 7 Schematic representation of the probe selection strategy for the construction of the Senegalese sole oligonucleotide
microarray. The number of transcripts that resulted after the described filtration is indicated.
Table 4 Validation of microarray data using qPCR
Microarray qPCR
SoleaDBcode Gene Gene name FC p-value FC p-value
Unigene18736 Angiotensin I converting enzyme 2 ace2 4.5 <0.001 4.9 <0.05
Unigene49603 Angiotensinogen agt 3.5 <0.01 4.7 <0.05
Unigene39473 Na-K-Cl cotransporter2 nkcc2 2.5 <0.01 3.13 <0.01
Unigene252320 Transferrin tf 15.6 <0.001 10.5 <0.01
Unigene214993 Ferritin fth 2.1 <0.01 2.3 <0.05
Unigene39196 Heat shock protein 90-alpha hsp90aa 2.7 <0.01 2.3 <0.01
Unigene54412 Trypsinogen1a try1 17.6 <0.001 12.0 <0.001
Unigene31826 Trypsinogen2 try2 4.7 <0.001 7.8 <0.05
Unigene53434 Chymotrypsinogen2 ctr2 7.2 <0.001 6.3 <0.05
Unigene52166 Elastase1 cela1 8.7 <0.001 7.8 <0.05
Unigene53593 Elastase4 cela4 7.1 <0.001 4.6 <0.05
Unigene54920 Complement component C3 c3 3.8 <0.05 34.0 <0.05
Unigene53521 Lysozyme g lyg 2.5 <0.05 3.6 <0.05
Unigene219622 Thyroid stimulating hormone, beta tshb 2.5 <0.05 4.6 <0.001
Unigene52404 Transaldolase taldo 2.1 <0.05 2.5 <0.05
Fold-changes (FC) and p-values obtained for target genes by microarray and qPCR are indicated. Moreover, the transcript code in the SoleaDB for S. senegalensis
v3 transcriptome is also shown. For qPCR, data were normalized to those of gapdh2 and referred to the calibrator group (36 ppt 3 DPH).
Microarray
validation
0"
10"
20"
30"
40"
50"
60"
C1qlike" c2" c3" c401" c402" c5" c9" factor"h"
Rela%ve'gene'expression'
Genes'
37'ppt'
10'ppt'
6"
8"
e'expression'
20"
25"
30"
35"
e'expression'
A
B
D
F
E
*
*
* *
*
*
*
*
0"
1"
2"
3"
4"
5"
ptgs1a" ptgs2"
Rela%ve'gene'expression'
Genes'
*
0"
100"
200"
300"
400"
500"
600"
il1b" il11a" il8b"
Rela%ve'gene'expression'
Genes'
*
*
*
*
BioIn4Next
SoleaDB: transcriptome database
17
http://www.juntadeandalucia.es/agriculturaypesca/ifapa/soleadb_ifapa/
Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next Benzekri et al. BMC Genomics 2014, 15:952
Current contents of SoleaDB
18
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
Download the
full information
for a subset of
transcripts
BioIn4Next
Browsing S. senegalensis transcriptome v 4.1
19
About the
assembling
Download the
complete transcriptome
Download all
annotations
Download the
full information
for a subset of
transcripts
Download
raw reads
BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed transcripts
BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed transcripts
More specific filtering/searching
BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed transcripts
More specific filtering/searching
Paginated
BioIn4Next
Browsing by transcript
20Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed transcripts
More specific filtering/searching
Paginated
Included in the
representative transcriptome
BioIn4Next
About one particular transcript
21
BioIn4Next
Markers: SNPs and SSRs
22Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Markers: SNPs and SSRs
22Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed SSRs
BioIn4Next
Markers: SNPs and SSRs
22Benzekri et al. BMC Genomics 2014, 15:952
Filtering options for
deployed SSRs
BioIn4Next
SoleaDB: a huge source molecular markers
23
representation of GATA repeats (<0.2% total repeat mo-
tifs) confirmed by FISH analysis (Additional file 9). Com-
parison of SSRs Blast-based orthologs in soles (Table 3
[7]. Two species-specific oligo-D
been reported in S. senegalensis an
limited number of unique transcri
number of ESTs available in soles [
was compensated to some extent u
croarrays [49]. The sole transcripto
study have overcome these restrictio
lect sole-specific probes is depicted
5,545 complete non-redundant tran
the 34,291 longest, non-redunda
cripts. Clustering them resulted in
redundant transcripts (Figure 7) tha
13,284 selected “Coding” transcrip
43,303 probes. The final panel of
related to reproduction, cell differ
stress, growth, biosynthetic and cat
port, embryonic development and i
other functions.
The microarray was tested with l
salinities (10 and 36 ppt). Hybrid
tected for 42,469 probes. A total
found differentially expressed (p <
were up-regulated and 175 down-re
pared to 36 ppt. Application of a
(expression ratio) > ±1 filtered 1,48
down-regulated probes. The differe
(DEGs) were involved in osmoregu
porters and the renin-angiotensin
Table 3 SSR summary statistics for whole and reference
transcriptomes
Type of SSR S. senegalensis S. solea
Whole transcriptome 266,434 316,388
Di-nucleotide 107,828 126,260
Tri-nucleotide 96,076 114,198
Tetra-nucleotide 39,102 44,118
Others 23,428 31,812
Reference transcriptome 49,955 67,610
Di-nucleotide 16,405 22,371
Tri-nucleotide 22,394 29,764
Tetra-nucleotide 6,935 8,829
Others 4,221 6,646
Blast-based orthologs 12,418 18,486
Species-specific SSR1
1,273 4,803
Conserved SSR 11,145 13,683
Same repeat motif2
6,596 6,772
Different repeat motif 4,549 6,911
Total number of SSRs and frequency according to their repeat motif
are indicated.
(1)
SSRs present in one species but not in orthologs of the other species.
(2)
Exactly the same SSR repeat motif was found in both orthologs; in a few
cases, SSR occurs once in one ortholog and twice in the other.
Benzekri et al. BMC Genomics 2014, 15:952
http://www.biomedcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
USES
BioIn4Next
SoleaDB: a huge source molecular markers
23
representation of GATA repeats (<0.2% total repeat mo-
tifs) confirmed by FISH analysis (Additional file 9). Com-
parison of SSRs Blast-based orthologs in soles (Table 3
[7]. Two species-specific oligo-D
been reported in S. senegalensis an
limited number of unique transcri
number of ESTs available in soles [
was compensated to some extent u
croarrays [49]. The sole transcripto
study have overcome these restrictio
lect sole-specific probes is depicted
5,545 complete non-redundant tran
the 34,291 longest, non-redunda
cripts. Clustering them resulted in
redundant transcripts (Figure 7) tha
13,284 selected “Coding” transcrip
43,303 probes. The final panel of
related to reproduction, cell differ
stress, growth, biosynthetic and cat
port, embryonic development and i
other functions.
The microarray was tested with l
salinities (10 and 36 ppt). Hybrid
tected for 42,469 probes. A total
found differentially expressed (p <
were up-regulated and 175 down-re
pared to 36 ppt. Application of a
(expression ratio) > ±1 filtered 1,48
down-regulated probes. The differe
(DEGs) were involved in osmoregu
porters and the renin-angiotensin
Table 3 SSR summary statistics for whole and reference
transcriptomes
Type of SSR S. senegalensis S. solea
Whole transcriptome 266,434 316,388
Di-nucleotide 107,828 126,260
Tri-nucleotide 96,076 114,198
Tetra-nucleotide 39,102 44,118
Others 23,428 31,812
Reference transcriptome 49,955 67,610
Di-nucleotide 16,405 22,371
Tri-nucleotide 22,394 29,764
Tetra-nucleotide 6,935 8,829
Others 4,221 6,646
Blast-based orthologs 12,418 18,486
Species-specific SSR1
1,273 4,803
Conserved SSR 11,145 13,683
Same repeat motif2
6,596 6,772
Different repeat motif 4,549 6,911
Total number of SSRs and frequency according to their repeat motif
are indicated.
(1)
SSRs present in one species but not in orthologs of the other species.
(2)
Exactly the same SSR repeat motif was found in both orthologs; in a few
cases, SSR occurs once in one ortholog and twice in the other.
Benzekri et al. BMC Genomics 2014, 15:952
http://www.biomedcentral.com/1471-2164/15/952
Benzekri et al. BMC Genomics 2014, 15:952
USES
BioIn4Next
Overview of descriptions
24
BioIn4Next
Overview of descriptions
24
BioIn4Next
Browsing by GOs
25
BioIn4Next
Browsing by GOs
25
BioIn4Next
Browsing ECs
26
BioIn4Next
Browsing ECs
26
BioIn4Next
Browsing ECs
26
More information about
this enzyme activity
BioIn4Next
Overview of KEGG pathways
27Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Overview of KEGG pathways
27
List of S.senegalensis v4.1
enzymes for this pathway
Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Overview of KEGG pathways
27
List of S.senegalensis v4.1
enzymes for this pathway
The complete overview
of this pathway
Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
For example: steroid biosynthesis
28
BioIn4Next
For example: steroid biosynthesis
28
BioIn4Next
Browsing by protein motifs and families
29
BioIn4Next
Browsing by protein motifs and families
29
BioIn4Next
Study of apolipoprotein A-IV paralogs
30
was then carried out using SEQBOOT (100 replicates) in the PHYLIP
package (Felsenstein, 1989) followed by a Phyml reconstruction (100
replicates) (Guindon and Gascuel, 2003). The consensus phylogenetic
tree was subsequently obtained (CONSENSE). Trees were drawn using
the Figtree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). Accession
numbers for sequences used in the phylogeny are indicated in Supple-
mentary file 1. Putative signal peptide was identified using SignalIP
(http://www.cbs.dtu.dk/services/SignalP/).
Genomic sequences were retrieved after blasting sequences onto a
de novo genome assembly for a female sole using Oases software with
a 51 k-mer (Benzekri et al., unpublished results). To identify intron
and exons boundaries, the two genomic scaffolds containing the apoA-
IV gene cluster sequences were aligned with apoA-IV cDNA sequences
using Seqman software. Also, a blast analysis (blastx) at NCBI was car-
ried out to establish gene synteny and identify other gene coding re-
gions. The two scaffold sequences have been deposited at NCBI/EMBL/
DDBJ with accession numbers LC056058 and LC056059. Synteny analy-
sis was carried out using ensembl (v79.01) and Genomicus genome
browser (http://www.genomicus.biologie.ens.fr/genomicus-79.01/cgi-
development. For apoA-IVAa1 and apoA-IVAa2, the incubation time os-
cillated between 60 and 105 min (depending on the larval stage),
while for apoA-IVBa3 and apoA-IVBa4 a fixed time of 60 min was used
in all stages. In all cases, fasted and fed larvae at 3, 5 and 9 dph were al-
ways managed in parallel and the same time for color development was
given. Twenty animals/sample-treatment/gene were used for each
WISH analysis. Digital images were captured using a Leica DFC290 HD
digital camera attached to a Leica DMIL LED inverted microscope.
2.4. RNA isolation and RT-qPCR analysis
Homogenization of samples, RNA isolation and cDNA synthesis pro-
cedures were carried out as previously described (Armesto et al., 2014,
2015). Real-time analysis was carried out on a CFX96™
Real-Time Sys-
tem (Bio-Rad) using Senegalese sole specific primers for each apoA-IV
transcript (Table 1). Real-time reactions were accomplished in a 10-μL
volume containing cDNA generated from 10 ng of original RNA tem-
plate, 300 nM each of specific forward and reverse primers, and 5 μL
of SYBR Premix Ex Taq (Takara, Clontech). The amplification protocol
Table 1
EST information and primer sequences for apoA-IV paralogs. The total number of ESTs (N) encoding for each paralog found at SoleaDB (v4.1; Benzekri et al., 2014) and the unigene ID
(v3 and v4.1) for sequences used for CDS(*), 5-(†) and 3-UTR (§) identification are indicated. Moreover, Primer sequences used for probe amplification (¥) and qPCR (‡) analysis and their
corresponding amplicons (bp) are also shown.
Paralog SoleaDB N Primer name Primer sequence (5′ ➔ 3)′ Size
apoA-IVAa1 solea_v3.0_unigene29941*
solea_v4.1_unigene546584†§
35 apoa41fc2(‡)
apoa41rc2(‡)
ATGGACCCAGAGGCGCTGAAGACCGTA
GGCCTGCAGCTCATCAGTGCTCTTGT
90(‡)
apoa41_3(¥)
apaa41_4(¥)
GGACAGGAAGTCAATACCAGGATCGCTCA
TAAACAGGAGGTGGAAAGTTGGCTGGAGT
669(¥)
apoA-IVAa2 solea_v4.1_unigene431170*
solea_v4.1_unigene546431_split_0†
solea_v4.1_ unigene 534078§
14 apoA42F(‡)
apoA42R(‡)
CCATGCGCACTCAGGTGGCTCCTC
CCTCGGCATAGGGCTGCAGATTGGT
132(‡)
apoA42_1(¥)
apoA42_2(¥)
CGACAGTCTGAGCTGGGAAAGG
GGCGGCAGCAGGAGAAAATAAC
667(¥)
apoA-IVBa3 solea_v3.0_unigene3621* solea_v4.1_unigene14920†§
24 apoa43_1(‡,¥)
apoa43_R(‡)
GTCCTCGTTGTGCTCGTCCTTGCTGT
CGTGTCCATCACTGGCTTGGGTGCATC
87(‡)
apoa43_2 (¥)
GCCTGCACCTCCTCGATGTATGGGGAA 719(¥)
apoA-IVBa4 solea_v3.0_unigene34222*
solea_v4.1_unigene547274†§
18 SseapoA44_F(‡)
SseapoA44_2(‡, ¥)
AGCTGAGACACAGAGCCAACCTGGTGA
CATTAGCTGGGCTTGGATGTCCTGGGT
107(‡)
SseapoA44_1(¥)
ATGCCAACCTTCTCTATGCGGATCCAC 689(¥)
86 J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
Román-Padilla et al. CBP Part B (2016) 191:84-98
Fig. 4. Phylogenetic relationships among the predicted sequences of Senegalese sole apoA-IV paralogs and the corresponding deduced amino acid sequences from other vertebrates (see
Supplementary file 1) using the Maximum Likelihood method. The apolipoprotein type and taxonomic group (fish or tetrapod) are indicated on the right. Moreover, the clusters A and B as
well as the four subclades (a1–a4) in Acanthopterygii are shown. The apoE sequences were used as outgroup to root tree. Only bootstrap values higher than 50% are indicated on each
branch. The scale for branch length (0.4 substitutions/site) is shown below the tree. Species abbreviations: Sse, Solea senegalensis; Cse, Cynoglossus semilaevis; Gac, Gasterosteus aculeatus;
Tru, Takifugu rubripes; Ame, Astyanax mexicanus; Dre, Danio rerio; Xtr, Xenopus tropicalis; Hsa, Homo sapiens; Rno, Rattus norvegicus; Mmu, Mus musculus; and Gga, Gallus gallus.
89J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
and Acanthopterygii. In the former, two or three species-specific
paralogs can be found within each cluster depending on the species al-
that expression of apoA-IV in YSL could be involved in the efficient mo-
bilization of TAG-rich molecules (throughout the formation of VLDL
Fig. 14. Transcript abundance of apoA-IV paralogs in different tissues of Senegalese sole juveniles. Data are represented in logarithmic scale. Expression values were normalized to those of
18S rRNA. Data were expressed as the mean fold change (mean + SEM, n = 3) from the calibrator group (kidney). Different letters denote tissues that are significantly different from liver
(P b 0.05).
95J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
USES
BioIn4Next
Putative miRNA precursors
31
BioIn4Next
Putative miRNA precursors
31
BioIn4Next
Ready for gene expression and more
32
BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
Select your
preferred
assemblies
BioIn4Next
Retrieving SoleaDB by sequence homology
33Benzekri et al. BMC Genomics 2014, 15:952
Paste your sequence
Or upload your file
of sequences
Select your
E-value filter Select your
preferred
assemblies
BioIn4Next
Retrieving SoleaDB by keywords
34Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Retrieving SoleaDB by keywords
34Benzekri et al. BMC Genomics 2014, 15:952
BioIn4Next
Soles retained the crystallin genes
35Benzekri et al. BMC Genomics 2014, 15:952
Figure 6 Phylogenetic tree of Crybb and Crybb-like proteins in vertebrates. A neighbor-joining tree based on the alignment of vertebrates
Crybb and Crybb-like sequences was built. Species are indicated as Sse (Solea senegalensis), Sso (Solea solea) Dre (Danio rerio), Tni (Tetraodon nigroviridis),
Oni (Oreochromis niloticus), Ola (Oryzia slatipes), Cse (Cynoglossus semilaevis), Xla (Xenopus laevis) and Gga (Gallus gallus; see Additional file 7 for accession
numbers). Solea sequences are indicated according to the transcript name assigned in SoleaDB. Clusters are indicated as arcs of a circle. The tree
obtained was rooted using Xenopus laevis Cryga. Numbers adjacent to nodes indicate percentage bootstrap support; only values larger than 70%
Benzekri et al. BMC Genomics 2014, 15:952 Page 10 of 18
http://www.biomedcentral.com/1471-2164/15/952
Fish-specific
cristallin?
Fish-specific
cristallin?
Absent in
flatfish
USES
BioIn4Next
Tisochrysis lutea database
36
Tisochrysis lutea
http://www.scbi.uma.es/isochrysisdb/
H. Benzekri (2016)
BioIn4Next
Tisochrysis lutea database
36
Tisochrysis lutea
http://www.scbi.uma.es/isochrysisdb/
Quite similar to
other microphytes
(microalgae)
H. Benzekri (2016)
BioIn4Next
Ruditapes database
37
http://www.scbi.uma.es/ruditapesdb/
H. Benzekri (2016)
BioIn4Next
Ruditapes database
37
http://www.scbi.uma.es/ruditapesdb/
Browsing and contents
similar to SoleaDB
H. Benzekri (2016)
BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
1 Illumina library: 

127 × 106
reads
2 × 75 nt
USES
BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
Too many small transcripts 1 Illumina library: 

127 × 106
reads
2 × 75 nt
USES
BioIn4Next
Most Ruditapes genes seem to be identified
38H. Benzekri (2016)
Too many small transcripts 1 Illumina library: 

127 × 106
reads
2 × 75 nt
Unique orthologues: 12 764 (32%)
Ruditapes philippinarum: 9 747 genes
USES
BioIn4Next
Bioinformatics tools based on genomes
39
e	production	technologies	and	applications	
to	marine	fish	aquaculture”
El	Puerto	de	Santa	María,	20-24	Junio
IFAPA	centro	El	Toruño
BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
	144	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No	olvides	completarlo	
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN 	
	
	
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
	152	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
H. Benzekri (2016)
USES
BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
	144	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No	olvides	completarlo	
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN 	
	
	
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
N50 is provided
by the longest
contig
	152	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
H. Benzekri (2016)
USES
BioIn4Next
Two Photobacterium damselae subsp. piscicida
40
	144	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21
Cepas
Referencia a las
figura IV.43 y IV.44
L091106-03H DI21
Total lecturas #1
Pareadas 148 622 433 717
Simples 297 269 187 433
Longitud media
Pareadas 509 445
Simples 1 195 550
Lecturas rechazadas #2
Pareadas 48 403 (32,6 %) 238 804 (55 %)
Simples 49 530 (16,7 %) 53 251 (28,4 %)
Contaminación
Pareadas 21556 (14,5 %) 62761(14,5 %)
Simples 46766 (15,7 %) 37791 (20,1 %)
Total de lecturas útiles #3 382 755 396 450
Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %)
Lecturas simples 313 437 263 900
Desde la librería de pareadas
(Lecturas no emparejadas)
65 553 (44,1 %) 129 264 (29,8 %)
Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %)
IV.3.1.2. Ensamblaje
El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual
recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos
permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de
14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para
realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas
lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo
Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en
los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la
figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del
borrador de genoma de L091106-03H.
M. Gonzalo Claros Díaz 10/11/2015 17:09
Comentario [8]: No	olvides	completarlo	
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado:
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: fueron
M. Gonzalo Claros Díaz 10/11/2015 17:09
Eliminado: dos
M. Gonzalo Claros Díaz 10/11/2015 17:09
RESULTADOS Y DISCUSIÓN 	
	
	
80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la
del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede
ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21
NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos
r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el
mblaje de esta cepa.
Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21
Cepas
L091106-03H (v2) DI21
Número de scaffolds > 500 pb 14 17
El scaffolds más largo 2 323 982 2 798 534
El scaffolds más corto 1 007 437
Suma de longitudes 4 194 408 4 316 437
Número de N 341 126 561 264
Longitud medía 299 600 227 180
N50 2 323 982 2 798 534
N90 157 598 152 634
Contenido G+C 40% 40,6%
1.4. Anotación de los dos borradores de genomas
La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el
ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11
Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad
del alineamiento entre las proteínas
Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2
N50 is provided
by the longest
contig
	152	 RESULTADOS Y DISCUSIÓN 	
	
	
	
Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y
DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad
mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190]
Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en
los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está
integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se
muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las
cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien
conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el
grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son
diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los
scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes
ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos
genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron
algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera
figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que
en general los genomas de las dos cepas son colineales.
Both pathogenic
strains are
highly syntenic
H. Benzekri (2016)
USES
BioIn4Next
Photobacterium-DB for browsing genomes
41
http://www.scbi.uma.es/photobacterium_damselae/
H. Benzekri (2016)P. Seoane-Zonjic (2016)
BioIn4Next
Photobacterium-DB for browsing genomes
41
http://www.scbi.uma.es/photobacterium_damselae/
H. Benzekri (2016)P. Seoane-Zonjic (2016)
BioIn4Next
Searchable and downloadable
42P. Seoane-Zonjic (2016)
BioIn4Next
Solea senegalensis genome assembling approach
43
2 × 75 nt
Female
3 kb paired-ends
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
H. Benzekri (2016)
Long paired-ends
Female
BioIn4Next
Solea senegalensis genome assembling approach
43
2 × 75 nt
Female
3 kb paired-ends
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
H. Benzekri (2016)
Long paired-ends
Female
BioIn4Next
Solea senegalensis genome assembling approach
43
2 × 75 nt
Female
3 kb paired-ends
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - GAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
H. Benzekri (2016)
Long paired-ends
Female
BioIn4Next
Solea senegalensis genome assembling approach
43
2 × 75 nt
Female
3 kb paired-ends
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - GAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
H. Benzekri (2016)
Long paired-ends
Female
BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
Chr4
Chr6
Chr8
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Cynoglossus semilaevis and soles are highly syntenic
44
Chr4
Chr6
Chr8
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
755
752
720
701
695
688
681
678
228
Based on protein
identity > 70%
Based on
transcript identity
	164	 RESULTADOS Y DISCUSIÓN 	
	
	
	
algunos puedan contener zonas del genoma o genes propios al lenguado senegalés que no están (o son
muy diferentes) en Cynoglossus semilaevis.
	
	
Figura IV.58: Ejemplo de alineamiento entre el Scaffod 1145 de S. senegalensis y el cromosoma 1 de C.
Semilaevis. Las zonas mostradas tienen un tamaño aproximativo de 150 kb. Se nota que fragmentos alineados se
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
H. Benzekri (2016)
BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)
BioIn4Next
One step beyond: from saffolds to chromosomes
45
Long reads
Female
AQUAGENET1
Female
AQUAGENET3
Female
8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads
RAY
Scaffolds Scaffolds
RAY
213 548 278 995
NUCMER - GAM-NGS - SSPACE - CAPcloser
Breaking into
artificial reads
Final scaffolds 34 176
Longest: 638 263 nt
Mean length: 14 565 nt
N50: 85 596 nt
Total Length: 600 Mbp
8 538 scaffolds
Longest: 638 263 nt
Mean length: 54 673 nt
N50: 105 233 nt
Total Length: 466.7 Mbp
ICMapper
Super-scaffolds
C. semilaevis
Chromosomes
22
H. Benzekri (2016)
BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
New markers
88/113 validated
Already established
linkage groups
113/129 SSR validated
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
S. senengalensis superscaffolds validated by molecular markers
46
New markers
88/113 validated
Already established
linkage groups
113/129 SSR validated
Females lack Chr W
→ XY system?
H. Benzekri (2016)Manchado et al (2016), in press
USES
BioIn4Next
Gene structure and synthey of apolipoproteins A-IV
47
USES
Román-Padilla et al. CBP Part B (2016) 191:84-98
block followed by a long domain containing 9 putative tandem repeats
flanked by the unrelated coding regions (UCR) 1 and 2 (Fig. 2). The com-
mon block was located into the exon 3 (except for apoA-IVAa1 in the
exon 2) and could be divided into the A, B and C segments. Seven out
of the 9 putative tandem repeats were 22-mer in length and contained
ters according the genomic clusters A and B, as described above. In
Ostariophysi, the apoA-IV duplicates within each cluster appeared close-
ly related each other in the same branch indicating a high similarity be-
tween intraspecific paralogs. In contrast, the apoA-IV duplicates within
each cluster in Acanthopterygii could be splitted into two clearly
Fig. 1. Gene structure of the four apoA-IV paralogs in Senegalese sole. The wide bars represent the exons, and thin lines the introns. The wide bars in red represent the 5′ and 3′ untranslated
regions whereas the ORF is shown in blue indicating signal peptides (dark blue) from the mature peptide (light blue). The size of exons and introns is also indicated. Only the length of the
exons is drawn to scale. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
ades (referred to as a1 an a2 for cluster A and a3 and a4
According to this phylogenetic tree, we named each
the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be noted that not all
Acanthopterygii species bear the four paralog types. G. aculeatus lacked
the apoA-IVAa1 and had two apoA-IVAa2-like paralogs (referred to as 1
no acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows indicate the position of in-
ptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repeats (A, B, C) of the common
indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline residues 117, 129 and 183 are
k. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, the reader is referred to the
article.)
J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98
separated subclades (referred to as a1 an a2 for cluster A and a3 and a4
for cluster B). According to this phylogenetic tree, we named each
paralog adding the genomic cluster and the Acanthopterygii subclade
they belonged to. Nevertheless, it should be
Acanthopterygii species bear the four paralog type
the apoA-IVAa1 and had two apoA-IVAa2-like para
Fig. 2. Deduced amino acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows
trons. The signal peptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repe
33-codon block are indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline resi
indicated by asterisk. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, th
web version of this article.)
Fig. 3. Physical synteny of apoA-IV paralogs. Cluster A. Synteny for apoA-IVAa1 and apoA-IVAa2 paralogs. Cluster B, synteny for apoA-IVBa3 and apoA-IVBa4 paral
the chromosome or scaffold location are indicated on the right. Each gene is represented by a color within each cluster. The coding direction is indicated by the p
indicate non-syntenic genes. “*” in T. rubripes denotes a gene identified by sequence analysis, not available in Genomicus platform “**” indicates an Apo
(ENSDARG00000095050). Gene names: apoC-I, apolipoprotein C-I; apoC-II, apolipoprotein C-II; apo14, apolipoprotein 14 kDa; apoEa and apoEb, apolipoprote
(Asp-Glu-Ala-Asp) box polypeptide 6; lipea, lipase, hormone-sensitive a; mep1b, meprin A, beta; msto1, misato 1, mitochondrial distribution and morphology
nine-rich splicing factor 4; and tomm40, translocase of outer mitochondrial membrane 40 homolog.
BioIn4Next
Genosole: a database for S. senegalensis genome draft
48
http://www.scbi.uma.es/GenoSole/
P. Seoane-Zonjic (2016)
COMING SOON
Rafa
Gonzalo
Rocío
Noé
Darío
49
Gonzalo
Isabel
Elena
Rosario
Pedro
David
P10-CVI-6075
BIO267
RTA2013-00068-C03

RTA2013-00023-C02
Marina
BioIn4Next
Hicham
M. Manchado
chnologies	and	applications	
	aquaculture”
nta	María,	20-24	Junio
ntro	El	Toruño
royecto	Algae4A-B
Rafa
Gonzalo
Rocío
Noé
Darío
49
Gonzalo
Isabel
Elena
Rosario
Pedro
David
P10-CVI-6075
BIO267
RTA2013-00068-C03

RTA2013-00023-C02
Marina
BioIn4Next
Hicham
M. Manchado
chnologies	and	applications	
	aquaculture”
nta	María,	20-24	Junio
ntro	El	Toruño
royecto	Algae4A-B

More Related Content

Viewers also liked

Lucainena de las Torres
Lucainena de las TorresLucainena de las Torres
Lucainena de las Torres
aniushka84
 
Gestion Educacional Chilena
Gestion Educacional ChilenaGestion Educacional Chilena
Gestion Educacional Chilena
Sebastian Muñoz
 
Professional Profile Claudio Mendizábal
Professional Profile Claudio MendizábalProfessional Profile Claudio Mendizábal
Professional Profile Claudio Mendizábal
cm1959
 
Plan de emergencias
Plan de emergenciasPlan de emergencias
Plan de emergencias
Lina Maria
 
Proceso irreversible nº2
Proceso irreversible nº2Proceso irreversible nº2
Proceso irreversible nº2
Germán Tortosa
 

Viewers also liked (20)

Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformatica
 
Building Anew, Biology
Building Anew, BiologyBuilding Anew, Biology
Building Anew, Biology
 
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMABioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
 
Lucainena de las Torres
Lucainena de las TorresLucainena de las Torres
Lucainena de las Torres
 
Gestion Educacional Chilena
Gestion Educacional ChilenaGestion Educacional Chilena
Gestion Educacional Chilena
 
Cuaderno caza 2012_2013
Cuaderno caza 2012_2013Cuaderno caza 2012_2013
Cuaderno caza 2012_2013
 
Professional Profile Claudio Mendizábal
Professional Profile Claudio MendizábalProfessional Profile Claudio Mendizábal
Professional Profile Claudio Mendizábal
 
SHC Combined One Page Capability Document
SHC Combined  One Page Capability DocumentSHC Combined  One Page Capability Document
SHC Combined One Page Capability Document
 
Webtechnology lab
Webtechnology labWebtechnology lab
Webtechnology lab
 
Laia p.
Laia p.Laia p.
Laia p.
 
Carta Escrita En El 2070
Carta Escrita En El 2070Carta Escrita En El 2070
Carta Escrita En El 2070
 
Convertir la experiencia de compra en algo emocionante
Convertir la experiencia de compra en algo emocionanteConvertir la experiencia de compra en algo emocionante
Convertir la experiencia de compra en algo emocionante
 
Presentacion power point paola evers
Presentacion power point  paola eversPresentacion power point  paola evers
Presentacion power point paola evers
 
Proiect
ProiectProiect
Proiect
 
Meta Products & Network Focused Design '12
Meta Products & Network Focused Design '12Meta Products & Network Focused Design '12
Meta Products & Network Focused Design '12
 
Cantarudo
CantarudoCantarudo
Cantarudo
 
Plan de emergencias
Plan de emergenciasPlan de emergencias
Plan de emergencias
 
The Role of 4G in Mobile Data Monetisation
The Role of 4G in Mobile Data MonetisationThe Role of 4G in Mobile Data Monetisation
The Role of 4G in Mobile Data Monetisation
 
Proceso irreversible nº2
Proceso irreversible nº2Proceso irreversible nº2
Proceso irreversible nº2
 
SERVICIO SOCIAL REDVOLUCION
SERVICIO SOCIAL REDVOLUCIONSERVICIO SOCIAL REDVOLUCION
SERVICIO SOCIAL REDVOLUCION
 

Similar to 160620 sole nomics v2

E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1
Vincent Breton
 

Similar to 160620 sole nomics v2 (20)

Managing the analysis of high-throughput data
Managing the analysis of high-throughput dataManaging the analysis of high-throughput data
Managing the analysis of high-throughput data
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
Artigo salivaprint
Artigo salivaprintArtigo salivaprint
Artigo salivaprint
 
Learn about the latest innovations at MilliporeSigma
Learn about the latest innovations at MilliporeSigmaLearn about the latest innovations at MilliporeSigma
Learn about the latest innovations at MilliporeSigma
 
Talk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meetingTalk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meeting
 
De novo RNA-seq for the study of ODAP synthesis pathway in Lathyrus sativus
De novo RNA-seq for the study of ODAP synthesis pathway in Lathyrus sativus De novo RNA-seq for the study of ODAP synthesis pathway in Lathyrus sativus
De novo RNA-seq for the study of ODAP synthesis pathway in Lathyrus sativus
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The Ugly
 
PLANT LEAF DISEASE CLASSIFICATION USING CNN
PLANT LEAF DISEASE CLASSIFICATION USING CNNPLANT LEAF DISEASE CLASSIFICATION USING CNN
PLANT LEAF DISEASE CLASSIFICATION USING CNN
 
T-bioinfo overview
T-bioinfo overviewT-bioinfo overview
T-bioinfo overview
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and Approaches
 
BecA-ILRI Hub genomics and bioinformatics platforms
BecA-ILRI Hub genomics and bioinformatics platformsBecA-ILRI Hub genomics and bioinformatics platforms
BecA-ILRI Hub genomics and bioinformatics platforms
 
Trans disciplinary research is a must for excellence in science by Prof. Moha...
Trans disciplinary research is a must for excellence in science by Prof. Moha...Trans disciplinary research is a must for excellence in science by Prof. Moha...
Trans disciplinary research is a must for excellence in science by Prof. Moha...
 
An evaluation of machine learning algorithms coupled to an electronic olfact...
An evaluation of machine learning algorithms coupled to an  electronic olfact...An evaluation of machine learning algorithms coupled to an  electronic olfact...
An evaluation of machine learning algorithms coupled to an electronic olfact...
 
E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1E biothon workshop 2014 04 15 v1
E biothon workshop 2014 04 15 v1
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
CV ABD JALIL 2015IN
CV ABD JALIL 2015INCV ABD JALIL 2015IN
CV ABD JALIL 2015IN
 
Best Practices for Validating a Next-Gen Sequencing Workflow
Best Practices for Validating a Next-Gen Sequencing WorkflowBest Practices for Validating a Next-Gen Sequencing Workflow
Best Practices for Validating a Next-Gen Sequencing Workflow
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
 

More from M. Gonzalo Claros

More from M. Gonzalo Claros (20)

Manuscritos-a-bioinfo Olimipadas.pdf
Manuscritos-a-bioinfo Olimipadas.pdfManuscritos-a-bioinfo Olimipadas.pdf
Manuscritos-a-bioinfo Olimipadas.pdf
 
Genoma humano con fósiles.pdf
Genoma humano con fósiles.pdfGenoma humano con fósiles.pdf
Genoma humano con fósiles.pdf
 
Genes, genomas y ordenadores.pdf
Genes, genomas y ordenadores.pdfGenes, genomas y ordenadores.pdf
Genes, genomas y ordenadores.pdf
 
210531 Covid-19 and bioinformatics
210531 Covid-19 and bioinformatics210531 Covid-19 and bioinformatics
210531 Covid-19 and bioinformatics
 
Redacta, corrige y traduce textos científicos sin morir en el intento
Redacta, corrige y traduce textos científicos sin morir en el intentoRedacta, corrige y traduce textos científicos sin morir en el intento
Redacta, corrige y traduce textos científicos sin morir en el intento
 
191129 aeter19 mgc slideshare
191129 aeter19 mgc slideshare191129 aeter19 mgc slideshare
191129 aeter19 mgc slideshare
 
191128 corrigere2 slideshare
191128 corrigere2 slideshare191128 corrigere2 slideshare
191128 corrigere2 slideshare
 
181214 Bioinformática vegetal
181214 Bioinformática vegetal181214 Bioinformática vegetal
181214 Bioinformática vegetal
 
180425 Bioinformatic workflows to discover transposon/gene biomarkers in cancer
180425 Bioinformatic workflows to discover transposon/gene biomarkers in cancer180425 Bioinformatic workflows to discover transposon/gene biomarkers in cancer
180425 Bioinformatic workflows to discover transposon/gene biomarkers in cancer
 
180427 Traducir, redactar y corregir: no solo de ciencia vive la ciencia
180427 Traducir, redactar y corregir: no solo de ciencia vive la ciencia180427 Traducir, redactar y corregir: no solo de ciencia vive la ciencia
180427 Traducir, redactar y corregir: no solo de ciencia vive la ciencia
 
Cómo traducir y redactar textos científicos en español
Cómo traducir y redactar textos científicos en españolCómo traducir y redactar textos científicos en español
Cómo traducir y redactar textos científicos en español
 
Vengo a hablar de mi libro
Vengo a hablar de mi libroVengo a hablar de mi libro
Vengo a hablar de mi libro
 
170602 Traducir química sin saber química
170602 Traducir química sin saber química170602 Traducir química sin saber química
170602 Traducir química sin saber química
 
¿Ciencia ficción o medicina personalizada? La tecnología al servicio de la sa...
¿Ciencia ficción o medicina personalizada? La tecnología al servicio de la sa...¿Ciencia ficción o medicina personalizada? La tecnología al servicio de la sa...
¿Ciencia ficción o medicina personalizada? La tecnología al servicio de la sa...
 
De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517
 
150522 bioinfo gis lr
150522 bioinfo gis lr150522 bioinfo gis lr
150522 bioinfo gis lr
 
Mi bioinformática para el IBIMA
Mi bioinformática para el IBIMAMi bioinformática para el IBIMA
Mi bioinformática para el IBIMA
 
Bioinformatics and the logic of life
Bioinformatics and the logic of lifeBioinformatics and the logic of life
Bioinformatics and the logic of life
 
Calidad de las traducciones. Reunión Red Vértice en Málaga 140606
Calidad de las traducciones. Reunión Red Vértice en Málaga 140606Calidad de las traducciones. Reunión Red Vértice en Málaga 140606
Calidad de las traducciones. Reunión Red Vértice en Málaga 140606
 
Bioinformática: desde las proteínas mitocondriales a la genómica
Bioinformática: desde las proteínas mitocondriales a la genómicaBioinformática: desde las proteínas mitocondriales a la genómica
Bioinformática: desde las proteínas mitocondriales a la genómica
 

Recently uploaded

Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Sérgio Sacani
 
MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...
Annibale Panichella
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
University of Hertfordshire
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
Sérgio Sacani
 

Recently uploaded (20)

Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptxBiochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptx
 
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyan
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024
 
MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
 
B lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and ActivationB lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and Activation
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 

160620 sole nomics v2

  • 1. BioIn4Next Bioinformatic platforms for the study of marine organisms M. Gonzalo Claros Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática Universidad de Málaga P.to S.ta M.ª 20-24/6/16 “Microalgae production technologies and applications to marine fish aquaculture” El Puerto de Santa María, 20-24 Junio IFAPA centro El Toruño 1º Seminario del proyecto Algae4A-B pecialistas en el onsorcio http://about.me/mgclaros/ @MGClaros claros@uma.es
  • 2. BioIn4Next Acuiculture is becoming a key source of food 2
  • 3. BioIn4Next Acuiculture is becoming a key source of food 2
  • 4. BioIn4Next Acuiculture is becoming a key source of food 2 All of them are non-model organisms
  • 5. BioIn4Next Non-model organisms: our expertise 3 http://www.scbi.uma.es/sustainpinedb/ http://www.juntadeandalucia.es/ agriculturaypesca/ifapa/soleadb_ifapa/ http://reprolive.eez.csic.es/ http://www.scbi.uma.es/pgc/ http://mejgenvegetal.uco.es/fgb2/gbrowse/Ca/ CicerDB
  • 8. BioIn4Next Combinatory strategy 4 None is the best The best result is obtained combining at least two different tools for the same analysis
  • 9. BioIn4Next Picasso: SuperComputing & BioInformatics @ UMA 5 Hard disks 7 FAT nodes Computing nodes THIN nodes More disks GPU nodes 768 cores 3 TB RAM 8 GB/core 80 cores 2 TB RAM >25 GB/core 32 GPU 1 TB RAM 8 GB/core 984 cores 4 TB RAM 4 GB/core Picasso: 
 2310 cores 700 TB disk
  • 10. BioIn4Next Our bioinformatic algorithms for non-model organims 6 Raw short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 & 47 paired-end + single CD-HIT 99% Miss-assembly rejection #2 Rejected Raw long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs Better transcriptome Mapped contigs Contigs Debris Non-coding Coding unmapped contigs BOWTIE 2 (mapping test) #2 Rejected Full-LengtherNext Missassemblies Contigs AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es
  • 11. BioIn4Next Our bioinformatic algorithms for non-model organims 6 Raw short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 & 47 paired-end + single CD-HIT 99% Miss-assembly rejection #2 Rejected Raw long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs Better transcriptome Mapped contigs Contigs Debris Non-coding Coding unmapped contigs BOWTIE 2 (mapping test) #2 Rejected Full-LengtherNext Missassemblies Contigs SOFTWARE Open Access SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read Juan Falgueras1 , Antonio J Lara2 , Noé Fernández-Pozo3 , Francisco R Cantón3 , Guillermo Pérez-Trabado2,4 , M Gonzalo Claros2,3* Abstract Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre- processing algorithms. Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already- published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts. Background Sequencing projects and Expressed Sequence Tags (ESTs) are essential for gene discovery, mapping, func- tional genomics and for future efforts in genome anno- tations, which include identification of novel genes, gene location, polymorphisms and even intron-exon bound- aries. The availability of high-throughput automated sequencing has enabled an exponential growth rate of sequence data, although not always with the desired quality. This exponential growth is enhanced by the so called “next-generation sequencing”, and efforts have to be made in order to increase the quality and reliability of sequences incorporated into databases: up to 0.4% of sequences in nucleotide databases contain contaminant sequences [1,2]. The situation is even worse in the EST databases, where vector contamination rate reach 1.63% of sequences [3]. Hence, improved and user friendly bioinformatic tools are required to produce more reli- able high-throughput pre-processing methods. Pre-processing includes filtering of low-quality sequences, identification of specific features (such as poly-A or poly-T tails, terminal transferase tails, and adaptors), removal of contaminant sequences (from vec- tor to any other artefacts) and trimming the undesired segments. There are some bioinformatic tools that can accomplish individual pre-processing aspects (e.g. Trim- Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4], crossmatch, Figaro [5]), and other programs that cope with the complete pre-processing pipeline such as PreGap4 [6] or the broadly used tools Lucy [7,8] and SeqClean [9]. Most of these require installation, are dif- ficult to configure, environment-specific, or focused on specific needs (like a design only for ESTs), or require a change in implementation and design of either the pro- gram or the protocols within the laboratory itself. * Correspondence: claros@uma.es 2 Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain Falgueras et al. BMC Bioinformatics 2010, 11:38 http://www.biomedcentral.com/1471-2105/11/38 © 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es
  • 12. BioIn4Next Our bioinformatic algorithms for non-model organims 6 Raw short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 & 47 paired-end + single CD-HIT 99% Miss-assembly rejection #2 Rejected Raw long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs Better transcriptome Mapped contigs Contigs Debris Non-coding Coding unmapped contigs BOWTIE 2 (mapping test) #2 Rejected Full-LengtherNext Missassemblies Contigs SOFTWARE Open Access SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read Juan Falgueras1 , Antonio J Lara2 , Noé Fernández-Pozo3 , Francisco R Cantón3 , Guillermo Pérez-Trabado2,4 , M Gonzalo Claros2,3* Abstract Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre- processing algorithms. Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already- published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts. Background Sequencing projects and Expressed Sequence Tags (ESTs) are essential for gene discovery, mapping, func- tional genomics and for future efforts in genome anno- tations, which include identification of novel genes, gene location, polymorphisms and even intron-exon bound- aries. The availability of high-throughput automated sequencing has enabled an exponential growth rate of sequence data, although not always with the desired quality. This exponential growth is enhanced by the so called “next-generation sequencing”, and efforts have to be made in order to increase the quality and reliability of sequences incorporated into databases: up to 0.4% of sequences in nucleotide databases contain contaminant sequences [1,2]. The situation is even worse in the EST databases, where vector contamination rate reach 1.63% of sequences [3]. Hence, improved and user friendly bioinformatic tools are required to produce more reli- able high-throughput pre-processing methods. Pre-processing includes filtering of low-quality sequences, identification of specific features (such as poly-A or poly-T tails, terminal transferase tails, and adaptors), removal of contaminant sequences (from vec- tor to any other artefacts) and trimming the undesired segments. There are some bioinformatic tools that can accomplish individual pre-processing aspects (e.g. Trim- Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4], crossmatch, Figaro [5]), and other programs that cope with the complete pre-processing pipeline such as PreGap4 [6] or the broadly used tools Lucy [7,8] and SeqClean [9]. Most of these require installation, are dif- ficult to configure, environment-specific, or focused on specific needs (like a design only for ESTs), or require a change in implementation and design of either the pro- gram or the protocols within the laboratory itself. * Correspondence: claros@uma.es 2 Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain Falgueras et al. BMC Bioinformatics 2010, 11:38 http://www.biomedcentral.com/1471-2105/11/38 © 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es A Web Tool to Discover Full-Length Sequences: Full-Lengther Antonio J Lara1 , Guillermo P´erez-Trabado2 , David P Villalobos1 , Sara D´ıaz-Moreno1 , Francisco R Cant´on1 , and M Gonzalo Claros3 1 Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario de Teatinos, E-29071 M´alaga, Spain, 2 Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos, E-29071 M´alaga, Spain, 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica Facultad de Ciencias Universidad de M´alaga 29071 M´alaga (Spain) Tel: +34 95 213 72 84 Fax: +34 95 213 20 41 E-mail: claros@uma.es Summary. Many Expressed Sequence Tags (EST) sequencing projects produce thousands of sequences that must be cleaned and annotated. Here it is presented Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro- tein database such as UniProt. Blast alignments will guide to locate protein coding regions, mainly the start codon. Full-Lengther contains an ORF prediction algo- rithm for those cases that do not deploy any alignment in the BLAST output. The algorithm is implemented as a web tool to simplify its use and portability. This can be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/.
  • 13. BioIn4Next Our bioinformatic algorithms for non-model organims 6 Raw short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 & 47 paired-end + single CD-HIT 99% Miss-assembly rejection #2 Rejected Raw long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs Better transcriptome Mapped contigs Contigs Debris Non-coding Coding unmapped contigs BOWTIE 2 (mapping test) #2 Rejected Full-LengtherNext Missassemblies Contigs SOFTWARE Open Access SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read Juan Falgueras1 , Antonio J Lara2 , Noé Fernández-Pozo3 , Francisco R Cantón3 , Guillermo Pérez-Trabado2,4 , M Gonzalo Claros2,3* Abstract Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre- processing algorithms. Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already- published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts. Background Sequencing projects and Expressed Sequence Tags (ESTs) are essential for gene discovery, mapping, func- tional genomics and for future efforts in genome anno- tations, which include identification of novel genes, gene location, polymorphisms and even intron-exon bound- aries. The availability of high-throughput automated sequencing has enabled an exponential growth rate of sequence data, although not always with the desired quality. This exponential growth is enhanced by the so called “next-generation sequencing”, and efforts have to be made in order to increase the quality and reliability of sequences incorporated into databases: up to 0.4% of sequences in nucleotide databases contain contaminant sequences [1,2]. The situation is even worse in the EST databases, where vector contamination rate reach 1.63% of sequences [3]. Hence, improved and user friendly bioinformatic tools are required to produce more reli- able high-throughput pre-processing methods. Pre-processing includes filtering of low-quality sequences, identification of specific features (such as poly-A or poly-T tails, terminal transferase tails, and adaptors), removal of contaminant sequences (from vec- tor to any other artefacts) and trimming the undesired segments. There are some bioinformatic tools that can accomplish individual pre-processing aspects (e.g. Trim- Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4], crossmatch, Figaro [5]), and other programs that cope with the complete pre-processing pipeline such as PreGap4 [6] or the broadly used tools Lucy [7,8] and SeqClean [9]. Most of these require installation, are dif- ficult to configure, environment-specific, or focused on specific needs (like a design only for ESTs), or require a change in implementation and design of either the pro- gram or the protocols within the laboratory itself. * Correspondence: claros@uma.es 2 Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071 Málaga, Spain Falgueras et al. BMC Bioinformatics 2010, 11:38 http://www.biomedcentral.com/1471-2105/11/38 © 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es A Web Tool to Discover Full-Length Sequences: Full-Lengther Antonio J Lara1 , Guillermo P´erez-Trabado2 , David P Villalobos1 , Sara D´ıaz-Moreno1 , Francisco R Cant´on1 , and M Gonzalo Claros3 1 Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario de Teatinos, E-29071 M´alaga, Spain, 2 Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos, E-29071 M´alaga, Spain, 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica Facultad de Ciencias Universidad de M´alaga 29071 M´alaga (Spain) Tel: +34 95 213 72 84 Fax: +34 95 213 20 41 E-mail: claros@uma.es Summary. Many Expressed Sequence Tags (EST) sequencing projects produce thousands of sequences that must be cleaned and annotated. Here it is presented Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro- tein database such as UniProt. Blast alignments will guide to locate protein coding regions, mainly the start codon. Full-Lengther contains an ORF prediction algo- rithm for those cases that do not deploy any alignment in the BLAST output. The algorithm is implemented as a web tool to simplify its use and portability. This can be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/. More than one equivalent tool
  • 14. BioIn4Next Choosing the best assembling in non-model organisms 7
  • 15. BioIn4Next Choosing the best assembling in non-model organisms 7 1 2
  • 16. BioIn4Next Choosing the best assembling in non-model organisms 7 1 2 Weighted PCA analysis
  • 17. BioIn4Next Transcriptome annotation for non-model organisms 8 Better transcriptome Full-LengtherNext (including user database) Artefacts & chimeras Useful transcripts Sma3s MREPS AutoFact FullLengtherNext (including TAIR & RefSeq) Transcript DESCRIPTION Transcript MODEL ORTHOLOGUE Transcript SSRs DESCRIPTION, GO, EC, KEGG pathway, InterPro Transcript ORF, STATUS & REFERENCE TRANSCRIPTOME OPT ANNOTATED transcriptome ready to import in a database Full-LengtherNext: A tool for characterisation and testing de novo transcriptome assemblies of non-model organisms Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero- Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
  • 18. BioIn4Next Transcriptome annotation for non-model organisms 8 Better transcriptome Full-LengtherNext (including user database) Artefacts & chimeras Useful transcripts Sma3s MREPS AutoFact FullLengtherNext (including TAIR & RefSeq) Transcript DESCRIPTION Transcript MODEL ORTHOLOGUE Transcript SSRs DESCRIPTION, GO, EC, KEGG pathway, InterPro Transcript ORF, STATUS & REFERENCE TRANSCRIPTOME OPT ANNOTATED transcriptome ready to import in a database A Web Tool to Discover Full-Length Sequences: Full-Lengther Antonio J Lara1 , Guillermo P´erez-Trabado2 , David P Villalobos1 , Sara D´ıaz-Moreno1 , Francisco R Cant´on1 , and M Gonzalo Claros3 1 Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario de Teatinos, E-29071 M´alaga, Spain, 2 Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos, E-29071 M´alaga, Spain, 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica Facultad de Ciencias Universidad de M´alaga 29071 M´alaga (Spain) Tel: +34 95 213 72 84 Fax: +34 95 213 20 41 E-mail: claros@uma.es Summary. Many Expressed Sequence Tags (EST) sequencing projects produce thousands of sequences that must be cleaned and annotated. Here it is presented Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro- tein database such as UniProt. Blast alignments will guide to locate protein coding regions, mainly the start codon. Full-Lengther contains an ORF prediction algo- rithm for those cases that do not deploy any alignment in the BLAST output. The algorithm is implemented as a web tool to simplify its use and portability. This can be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/. 1 Introduction New biological technology produces a large amount of sequences in form of ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an- notated to uncover, for example, its funtion. Currently, the task of annotating EST sequences does not keep pace with the rate at which they are gener- ated [1] since: 1. EST sequence annotation is computationally intensive and often returns no results; 2. EST data suffers from inconsistency problems (error rate, contaminant sequences, low complexity regions, etc.); 3. gene identification programs perform inconsistently as they are sensitive to errors. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es Recycling Full-LengtherNext: A tool for characterisation and testing de novo transcriptome assemblies of non-model organisms Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero- Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
  • 19. BioIn4Next Transcriptome annotation for non-model organisms 8 Better transcriptome Full-LengtherNext (including user database) Artefacts & chimeras Useful transcripts Sma3s MREPS AutoFact FullLengtherNext (including TAIR & RefSeq) Transcript DESCRIPTION Transcript MODEL ORTHOLOGUE Transcript SSRs DESCRIPTION, GO, EC, KEGG pathway, InterPro Transcript ORF, STATUS & REFERENCE TRANSCRIPTOME OPT ANNOTATED transcriptome ready to import in a database Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4, and ANTONIO J. Pe´rez-Pulido5,* Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1 ; Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2 ; Molecular Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3 ; Computer Architecture Department, University of Ma´laga, Campus de Teatinos, Spain4 and Centro Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013, Spain5 *To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376. E-mail: ajperez@upo.es Edited by Prof. Kenta Nakai (Received 29 October 2013; accepted 6 January 2014) Abstract Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivityand spe- cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool 1. Introduction Sequenceannotationistheprocessofassociatingbio- logicalinformationtosequencesofinterest.Annotations can include the potential function, cellular localization, biological process or protein structure of a given se- quence.1 Some sequences are annotated using direct ex- perimental evidence, but most annotations are inferred from sequence similarities or conserved patterns asso- ciated with known characteristics.2–5 Large publically accessible databases of annotated sequences make it possible to automatically annotate large collections of unknown sequences. This is especially valuable for the interpretation of large sequence datasets generated by genome and expressed sequence tag (EST) sequencing projects as well as gene and protein expression experi- ments, such as DNA microarrays, and many other emer- ging research areas.6 Sequence annotation is also important in transcrip- tomic experiments that aim to identify gene clusters with similarexpression patternsthat are linked to a par- ticular biological process or experimental condition. Biological function can then be inferred from annota- tions shared within these clusters.7 # The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/ .0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com. DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001 Advance Access publication on 5 February 2014 4 byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom A Web Tool to Discover Full-Length Sequences: Full-Lengther Antonio J Lara1 , Guillermo P´erez-Trabado2 , David P Villalobos1 , Sara D´ıaz-Moreno1 , Francisco R Cant´on1 , and M Gonzalo Claros3 1 Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario de Teatinos, E-29071 M´alaga, Spain, 2 Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos, E-29071 M´alaga, Spain, 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica Facultad de Ciencias Universidad de M´alaga 29071 M´alaga (Spain) Tel: +34 95 213 72 84 Fax: +34 95 213 20 41 E-mail: claros@uma.es Summary. Many Expressed Sequence Tags (EST) sequencing projects produce thousands of sequences that must be cleaned and annotated. Here it is presented Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro- tein database such as UniProt. Blast alignments will guide to locate protein coding regions, mainly the start codon. Full-Lengther contains an ORF prediction algo- rithm for those cases that do not deploy any alignment in the BLAST output. The algorithm is implemented as a web tool to simplify its use and portability. This can be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/. 1 Introduction New biological technology produces a large amount of sequences in form of ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an- notated to uncover, for example, its funtion. Currently, the task of annotating EST sequences does not keep pace with the rate at which they are gener- ated [1] since: 1. EST sequence annotation is computationally intensive and often returns no results; 2. EST data suffers from inconsistency problems (error rate, contaminant sequences, low complexity regions, etc.); 3. gene identification programs perform inconsistently as they are sensitive to errors. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es Recycling Full-LengtherNext: A tool for characterisation and testing de novo transcriptome assemblies of non-model organisms Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero- Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
  • 20. BioIn4Next Transcriptome annotation for non-model organisms 8 Better transcriptome Full-LengtherNext (including user database) Artefacts & chimeras Useful transcripts Sma3s MREPS AutoFact FullLengtherNext (including TAIR & RefSeq) Transcript DESCRIPTION Transcript MODEL ORTHOLOGUE Transcript SSRs DESCRIPTION, GO, EC, KEGG pathway, InterPro Transcript ORF, STATUS & REFERENCE TRANSCRIPTOME OPT ANNOTATED transcriptome ready to import in a database Sma3s: AThree-Step Modular Annotator for Large Sequence Datasets ANTONIO Mun˜oz-Me´rida1, ENRIQUE Viguera2, M. GONZALO Claros3, OSWALDO Trelles1,4, and ANTONIO J. Pe´rez-Pulido5,* Integrated Bioinformatics, National Institute for Bioinformatics, University of Ma´laga, Campus de Teatinos, Spain1 ; Cellular Biology, Genetics and Physiology Department, University of Ma´laga, Campus de Teatinos, Spain2 ; Molecular Biology and Biochemistry Department, University of Ma´laga, Campus de Teatinos, Spain3 ; Computer Architecture Department, University of Ma´laga, Campus de Teatinos, Spain4 and Centro Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC-JA), Facultad de Ciencias Experimentales (A´rea de Gene´tica), Universidad Pablo de Olavide, Sevilla 41013, Spain5 *To whom correspondence should be addressed. Tel. þ34 954-348-652. Fax. þ34 954-349-376. E-mail: ajperez@upo.es Edited by Prof. Kenta Nakai (Received 29 October 2013; accepted 6 January 2014) Abstract Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivityand spe- cificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool 1. Introduction Sequenceannotationistheprocessofassociatingbio- logicalinformationtosequencesofinterest.Annotations can include the potential function, cellular localization, biological process or protein structure of a given se- quence.1 Some sequences are annotated using direct ex- perimental evidence, but most annotations are inferred from sequence similarities or conserved patterns asso- ciated with known characteristics.2–5 Large publically accessible databases of annotated sequences make it possible to automatically annotate large collections of unknown sequences. This is especially valuable for the interpretation of large sequence datasets generated by genome and expressed sequence tag (EST) sequencing projects as well as gene and protein expression experi- ments, such as DNA microarrays, and many other emer- ging research areas.6 Sequence annotation is also important in transcrip- tomic experiments that aim to identify gene clusters with similarexpression patternsthat are linked to a par- ticular biological process or experimental condition. Biological function can then be inferred from annota- tions shared within these clusters.7 # The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/ .0/), which permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com. DNA RESEARCH 21, 341–353, (2014) doi:10.1093/dnares/dsu001 Advance Access publication on 5 February 2014 4 byguestonAugust21,2014http://dnaresearch.oxfordjournals.org/Downloadedfrom More than one equivalent tool A Web Tool to Discover Full-Length Sequences: Full-Lengther Antonio J Lara1 , Guillermo P´erez-Trabado2 , David P Villalobos1 , Sara D´ıaz-Moreno1 , Francisco R Cant´on1 , and M Gonzalo Claros3 1 Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, Campus Universitario de Teatinos, E-29071 M´alaga, Spain, 2 Arquitectura de Computadores, E.T.S.I. Inform´atica, Campus de Teatinos, E-29071 M´alaga, Spain, 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica Facultad de Ciencias Universidad de M´alaga 29071 M´alaga (Spain) Tel: +34 95 213 72 84 Fax: +34 95 213 20 41 E-mail: claros@uma.es Summary. Many Expressed Sequence Tags (EST) sequencing projects produce thousands of sequences that must be cleaned and annotated. Here it is presented Full-Lengther, an algorithm that can find out full-length cDNA sequences from EST data. To accomplish this task, Full-Lenther is based on a BLAST report using a pro- tein database such as UniProt. Blast alignments will guide to locate protein coding regions, mainly the start codon. Full-Lengther contains an ORF prediction algo- rithm for those cases that do not deploy any alignment in the BLAST output. The algorithm is implemented as a web tool to simplify its use and portability. This can be worldwide accessible via http://castanea.ac.uma.es/genuma/full-lengther/. 1 Introduction New biological technology produces a large amount of sequences in form of ESTs (Expressed Sequence Tags). These sequences have to be thoroughly an- notated to uncover, for example, its funtion. Currently, the task of annotating EST sequences does not keep pace with the rate at which they are gener- ated [1] since: 1. EST sequence annotation is computationally intensive and often returns no results; 2. EST data suffers from inconsistency problems (error rate, contaminant sequences, low complexity regions, etc.); 3. gene identification programs perform inconsistently as they are sensitive to errors. AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-Model Species, such as Faba Bean (Vicia faba) Running title: AutoFlow, a versatile workflow engine Pedro Seoane1 , Sara Ocaña2 , Rosario Carmona3 , Rocío Bautista3 , Eva Madrid4 , Ana M. Torres2 , M. Gonzalo Claros1,3,* 1 Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, E-29071, Malaga, Spain 2 Área de Mejora y Biotecnología, IFAPA Centro “Alameda del Obispo”, Apdo 3092, E-14080 Cordoba, Spain 3 Plataforma Andaluza de Bioinformática, Universidad de Málaga, E-29071 Malaga, Spain 4 Institute for Sustainable Agriculture, CSIC, Apdo 4084, E-14080 Cordoba, Spain * Corresponding author Manuel Gonzalo Claros Díaz Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, E-29071, Malaga (Spain) Fax: +34 95 213 20 41 Tel: +34 95 213 72 84 E-mail: claros@uma.es Recycling Full-LengtherNext: A tool for characterisation and testing de novo transcriptome assemblies of non-model organisms Pedro Seoane1, Noé Fernández-Pozo1,2, Darío Guerrero- Fernández3, Rocío Bautista3 and M. Gonzalo Claros1,3,*
  • 21. BioIn4Next Our bioinformatic contribution to aquaculture 9 Transcriptomes Solea senegalensis Solea solea Tisochrysis lutea Ruditapes decussatus Genomes Solea senegalensis Photobacterium damselae subsp. piscicida (x2) SNPs Mytilus edulis Crassostrea angulata Human food Human food Aquaculture feed Human food Aquaculture diseases Human food Human food Tetraselmis chuii
  • 22. BioIn4Next Bioinformatics tools based on transcriptomes 10 e production technologies and applications to marine fish aquaculture” El Puerto de Santa María, 20-24 Junio IFAPA centro El Toruño
  • 23. BioIn4Next NGS read pre-processing for 2 sole transcriptomes 11 NGS platform Illumina 454 Species S. senegalensis S. solea S. senegalensis Total Input Reads 1,800,249,230 2,101,324,072 5,663,225 mean length 76 100 757 Rejected (total) N 237,941,945 345,251,849 1,562,661 % 13.5 17.1 26.8 by contamination N 144,247,943 226,627,909 156.921 % 8.2 11.2 3.0 Useful reads N 1,561,416,814 1,746,258,741 3,774,412 % 86.7 83.1 67.6 paired reads N 1,503,882,050 1,676,160,406 - % 83.3 79.5 - single reads N 57,534,764 70,098,335 3,774,412 % 3.2 3.3 67.6 mean length 66 89 184 Benzekri et al. BMC Genomics 2014, 15:952
  • 24. BioIn4Next Overview of the two sole transcriptomes 12 S. senegalensis S. solea v3 v4 v1 Unigenes % Unigenes % Unigenes % Total 252,416 100.00 % 697,124 100.00% 531,463 100.00% >500pb 37,593 14.90 % 156,083 22.24% 165,860 31.22% >200pb 168,914 66.92 % 385,411 54.92% 338,967 63.89% Longest unigene 6,050 - 40,163 - 68,559 - Misassembled 18 0.01 % 215 0.03% 116 0.02% Putative chimera 984 0.39 % 6,345 0.91% 9,447 1.80% Unigene report With an orthologue 1 81,348 32.23 % 147,536 21.74% 121,696 22.90% Different orthologue IDs 41,792 51.37 % 45,063 30.87% 38,402 31.56% Complete ORFs 6,742 8.31 % 39,727 26.12% 52,051 42.77% Different, complete ORFs 4,376 5.38 % 18,738 12.34% 22,683 18.64% C-terminus 14,757 18.14 % 27,080 17.94% 19,579 16.09% N-terminus 11,298 13.88 % 27,638 18,52% 25,131 20.65% Internal 47,529 58.43% 53,091 37.42% 24,935 20.49% Putative ncRNA 539 0.21 % 1,252 0.18% 1,075 0.20% Without orthologue 1 171,067 67.56 % 545,491 78.08% 408,692 76.90% Putative New Genes 22,612 13,21 % 39,812 7,49% 34,194 8,37% Non-redundant put. new genes nc – 14,451 2,51% 14,528 3.55% Unknown 147,916 86.48 % 506,679 92.51% 374,498 91.63% Reference transcriptome nc – 59,514 8.85% 54,005 10.16% Only 454 Only Illumina 454 + Illumina Very useful Benzekri et al. BMC Genomics 2014, 15:952
  • 25. BioIn4Next Soles are transcriptomically similar 13 0%# 7%# 0%# 9%# 1%# 0%# 7%# 1%# 15%# 2%# 6%# 2%#1%# 2%# 12%# 1%# 6%# 22%# 1%# 6%# S.#senegalensis# Viral#reproduc5on# Signaling# Rhythmic#process# Response#to#s5mulus# Reproduc5on# Pigmenta5on# Mul5cellular#organismal#process# Mul5Aorganism#process# Metabolic#process# Locomo5on# Localiza5on# Immune#system#process# Growth# Biological#adhesion# Biological#regula5on# Cell#prolifera5on# 0%# 7%# 0%# 9%# 1%# 0%# 7%# 1%# 15%# 2%# 6%# 2%#1%#1%# 12%# 1%# 6%# 22%# 1%# 6%# S.#solea# Viral#reproduc5on# Signaling# Rhythmic#process# Response#to#s5mulus# Reproduc5on# Pigmenta5on# Mul5cellular#organismal#process# Mul5Aorganism#process# Metabolic#process# Locomo5on# Localiza5on# Immune#system#process# Growth# Biological#adhesion# Biological#regula5on# Cell#prolifera5on# Biogenesis# Cellular#process# Death# Developmentl#process# 7%# 4%# 4%# 1%# 3%# 5%# 4%# 1%# 0%# 30%# 0%# 41%# S.#senegalensis# Transporter#ac3vity# Structural#molecule#ac3vity# Receptor#ac3vity# Protein#binding#trasncrip3on#factor#ac3vity# Nucleic#acid#binding#transcrip3on#factor#ac3vity# Molecular#transducer#ac3vity# Enzyme#regulator#ac3vity# Electron#carrier#ac3vity# Channel#regulator#ac3vity# Cataly3c#ac3vity# An3oxidant#ac3vity# Binding# 7%# 3%# 4%# 1%# 3%# 4%# 5%# 1%# 30%# 42%# S.#solea# Transporter#ac3vity# Structural#molecule#ac3vity# Receptor#ac3vity# Protein#binding#trasncrip3on#factor#ac3vity# Nucleic#acid#binding#transcrip3on#factor#ac3vity# Molecular#transducer#ac3vity# Enzyme#regulator#ac3vity# Electron#carrier#ac3vity# Channel#regulator#ac3vity# Cataly3c#ac3vity# An3oxidant#ac3vity# Binding# 7%# 4%# 4%# 1%# 3%# 5%# 4%# 1%# 0%# 30%# 0%# 41%# S.#senegalensis# Transporter#ac3vity# Structural#molecule#ac3vity# Receptor#ac3vity# Protein#binding#trasncrip3on#factor#ac3vity# Nucleic#acid#binding#transcrip3on#factor#ac3vity# Molecular#transducer#ac3vity# Enzyme#regulator#ac3vity# Electron#carrier#ac3vity# Channel#regulator#ac3vity# Cataly3c#ac3vity# An3oxidant#ac3vity# Binding# 7%# 3%# 4%# 1%# 3%# 4%# 5%# 1%# 30%# 42%# S.#solea# Transporter#ac3vity# Structural#molecule#ac3vity# Receptor#ac3vity# Protein#binding#trasncrip3on#factor#ac3vity# Nucleic#acid#binding#transcrip3on#factor#ac3vity# Molecular#transducer#ac3vity# Enzyme#regulator#ac3vity# Electron#carrier#ac3vity# Channel#regulator#ac3vity# Cataly3c#ac3vity# An3oxidant#ac3vity# Binding# 0%# 7%# 0%# 9%# 1%# 0%# 7%# 1%# 15%# 2%# 6%# 2%#1%# 2%# 12%# 1%# 6%# 22%# 1%# 6%# S.#senegalensis# Viral#reproduc5on# Signaling# Rhythmic#process# Response#to#s5mulus# Reproduc5on# Pigmenta5on# Mul5cellular#organismal#process# Mul5Aorganism#process# Metabolic#process# Locomo5on# Localiza5on# Immune#system#process# Growth# Biological#adhesion# Biological#regula5on# Cell#prolifera5on# 0%# 7%# 0%# 9%# 1%# 0%# 7%# 1%# 15%# 2%# 6%# 2%#1%#1%# 12%# 1%# 6%# 22%# 1%# 6%# S.#solea# Viral#reproduc5on# Signaling# Rhythmic#process# Response#to#s5mulus# Reproduc5on# Pigmenta5on# Mul5cellular#organismal#process# Mul5Aorganism#process# Metabolic#process# Locomo5on# Localiza5on# Immune#system#process# Growth# Biological#adhesion# Biological#regula5on# Cell#prolifera5on# Biogenesis# Cellular#process# Death# Developmentl#process# 2%# 22%# 4%# 16%# 36%# 2%# 1%# 3%# 14%# S.#senegalensis# Synapse# Organelle# Membrane6enclosed#lumen# Membrane# Cell# Cell#junc=on# Extracellular#matrix# Extracellular#region# Macromolecular#complex# 1%# 22%# 4%# 16%# 36%# 2%# 2%# 3%# 14%# S.#solea# Synapse# Organelle# Membrane6enclosed#lumen# Membrane# Cell# Cell#junc=on# Extracellular#matrix# Extracellular#region# Macromolecular#complex# 2%# 22%# 4%# 16%# 36%# 2%# 1%# 3%# 14%# S.#senegalensis# Synapse# Organelle# Membrane6enclosed#lumen# Membrane# Cell# Cell#junc=on# Extracellular#matrix# Extracellular#region# Macromolecular#complex# 1%# 22%# 4%# 16%# 36%# 2%# 2%# 3%# 14%# S.#solea# Synapse# Organelle# Membrane6enclosed#lumen# Membrane# Cell# Cell#junc=on# Extracellular#matrix# Extracellular#region# Macromolecular#complex# A B C S. senegalensis S. solea Benzekri et al. BMC Genomics 2014, 15:952 Biological process Cellular component Molecular function USES
  • 26. BioIn4Next Soles and zebrafish are highly orthologous 14bution of the level of similarity between both sole reference transcriptomes for those transcripts with (dar C Genomics 2014, 15:952 edcentral.com/1471-2164/15/952 Benzekri et al. BMC Genomics 2014, 15:952 Transcripts having a zebrafish orthologue are more similar between soles Transcripts lacking a zebrafish orthologous still are significantly homologous between soles USES
  • 27. BioIn4Next There are lineage-specific genes in teleosts 15 Likely protein coding W/O zebrafish orthologue Orthologs between soles of unknown function 137 351 Orthologs in other teleosts proteins: Gadus morhua 7 155 Oryzias latipes 10 190 Oreochromis niloticus 17 241 Tetraodon nigroviridis 6 198 Gasterosteus aculeatus 17 235 In at least one of these species 27 290 Orthologs in Cynoglossus semilaevis DNA (flatfish) 99 287 Orthologs in teleosts but not in flatfish 3 46 Specific orthologs only in flatfish 75 43 Without ortholog 35 18 Benzekri et al. BMC Genomics 2014, 15:952 sole-specific genes flatfish-specific genes USES
  • 28. BioIn4Next UNIGENES S. senegalensis v3 Complete 6,742 N-terminal 11,268 Internal 47,259 C-terminal 14,757 Coding 22,612 With ORF 22,612 ncRNA non-redundant coding 21,314 Inconsistent unigenes 51,218 SELECTED unigenes for microarray Selectthemost3',non- redundantunigene Select,non-redundant completeunigene Most 3', non- redundant incomplete unigenes 34,291 Longest, non- redundant, complete unigenes 5,545 Selectlongerandnon-redundantunigenes CD-HIT Selection of unigenes qualified as coding and with ORF ORF-Predictor Full-LengtherNext 21,099 30,119 Development of microarray and qPCR primers 16Benzekri et al. BMC Genomics 2014, 15:952 Feature selection algorithm for microarray printing
  • 29. BioIn4Next UNIGENES S. senegalensis v3 Complete 6,742 N-terminal 11,268 Internal 47,259 C-terminal 14,757 Coding 22,612 With ORF 22,612 ncRNA non-redundant coding 21,314 Inconsistent unigenes 51,218 SELECTED unigenes for microarray Selectthemost3',non- redundantunigene Select,non-redundant completeunigene Most 3', non- redundant incomplete unigenes 34,291 Longest, non- redundant, complete unigenes 5,545 Selectlongerandnon-redundantunigenes CD-HIT Selection of unigenes qualified as coding and with ORF ORF-Predictor Full-LengtherNext 21,099 30,119 Development of microarray and qPCR primers 16Benzekri et al. BMC Genomics 2014, 15:952 Feature selection algorithm for microarray printing microarray provided repetitive and consistent positive hybridization signals. Conclusions De novo transcriptomes of S. solea and S. senegalensis covering their main developmental stages and organs were described based on a combined assembly approach that can be applied to other transcriptomic studies. The huge volume of reads processed in each species (>1,800 millions, the highest number of reads reported to date for any or- ganism) produced a high number of transcripts that were mined to obtain a representative reference transcriptome Transcripts S. senegalensis v3 Complete 6,742 47,259 C-terminal 14,757 Coding 22,612 With ORF 22,612 non-redundant coding 21,314 Inconsistent transcripts 51,218 SELECTED transcripts for microarray Selectt redun Select,non complete Longest, non- redundant, complete transcripts 5,545 Selectlongerandnon-r CD-HIT as coding and with ORF ORF-Predictor 21,099 30,119 Figure 7 Schematic representation of the probe selection strategy for the construction of the Senegalese sole oligonucleotide microarray. The number of transcripts that resulted after the described filtration is indicated. Table 4 Validation of microarray data using qPCR Microarray qPCR SoleaDBcode Gene Gene name FC p-value FC p-value Unigene18736 Angiotensin I converting enzyme 2 ace2 4.5 <0.001 4.9 <0.05 Unigene49603 Angiotensinogen agt 3.5 <0.01 4.7 <0.05 Unigene39473 Na-K-Cl cotransporter2 nkcc2 2.5 <0.01 3.13 <0.01 Unigene252320 Transferrin tf 15.6 <0.001 10.5 <0.01 Unigene214993 Ferritin fth 2.1 <0.01 2.3 <0.05 Unigene39196 Heat shock protein 90-alpha hsp90aa 2.7 <0.01 2.3 <0.01 Unigene54412 Trypsinogen1a try1 17.6 <0.001 12.0 <0.001 Unigene31826 Trypsinogen2 try2 4.7 <0.001 7.8 <0.05 Unigene53434 Chymotrypsinogen2 ctr2 7.2 <0.001 6.3 <0.05 Unigene52166 Elastase1 cela1 8.7 <0.001 7.8 <0.05 Unigene53593 Elastase4 cela4 7.1 <0.001 4.6 <0.05 Unigene54920 Complement component C3 c3 3.8 <0.05 34.0 <0.05 Unigene53521 Lysozyme g lyg 2.5 <0.05 3.6 <0.05 Unigene219622 Thyroid stimulating hormone, beta tshb 2.5 <0.05 4.6 <0.001 Unigene52404 Transaldolase taldo 2.1 <0.05 2.5 <0.05 Fold-changes (FC) and p-values obtained for target genes by microarray and qPCR are indicated. Moreover, the transcript code in the SoleaDB for S. senegalensis v3 transcriptome is also shown. For qPCR, data were normalized to those of gapdh2 and referred to the calibrator group (36 ppt 3 DPH). Microarray validation 0" 10" 20" 30" 40" 50" 60" C1qlike" c2" c3" c401" c402" c5" c9" factor"h" Rela%ve'gene'expression' Genes' 37'ppt' 10'ppt' 6" 8" e'expression' 20" 25" 30" 35" e'expression' A B D F E * * * * * * * * 0" 1" 2" 3" 4" 5" ptgs1a" ptgs2" Rela%ve'gene'expression' Genes' * 0" 100" 200" 300" 400" 500" 600" il1b" il11a" il8b" Rela%ve'gene'expression' Genes' * * * *
  • 31. BioIn4Next Benzekri et al. BMC Genomics 2014, 15:952 Current contents of SoleaDB 18
  • 32. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19
  • 33. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19
  • 34. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19 About the assembling
  • 35. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19 About the assembling Download the complete transcriptome
  • 36. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19 About the assembling Download the complete transcriptome Download all annotations
  • 37. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19 About the assembling Download the complete transcriptome Download all annotations Download the full information for a subset of transcripts
  • 38. BioIn4Next Browsing S. senegalensis transcriptome v 4.1 19 About the assembling Download the complete transcriptome Download all annotations Download the full information for a subset of transcripts Download raw reads
  • 39. BioIn4Next Browsing by transcript 20Benzekri et al. BMC Genomics 2014, 15:952
  • 40. BioIn4Next Browsing by transcript 20Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed transcripts
  • 41. BioIn4Next Browsing by transcript 20Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed transcripts More specific filtering/searching
  • 42. BioIn4Next Browsing by transcript 20Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed transcripts More specific filtering/searching Paginated
  • 43. BioIn4Next Browsing by transcript 20Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed transcripts More specific filtering/searching Paginated Included in the representative transcriptome
  • 45. BioIn4Next Markers: SNPs and SSRs 22Benzekri et al. BMC Genomics 2014, 15:952
  • 46. BioIn4Next Markers: SNPs and SSRs 22Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed SSRs
  • 47. BioIn4Next Markers: SNPs and SSRs 22Benzekri et al. BMC Genomics 2014, 15:952 Filtering options for deployed SSRs
  • 48. BioIn4Next SoleaDB: a huge source molecular markers 23 representation of GATA repeats (<0.2% total repeat mo- tifs) confirmed by FISH analysis (Additional file 9). Com- parison of SSRs Blast-based orthologs in soles (Table 3 [7]. Two species-specific oligo-D been reported in S. senegalensis an limited number of unique transcri number of ESTs available in soles [ was compensated to some extent u croarrays [49]. The sole transcripto study have overcome these restrictio lect sole-specific probes is depicted 5,545 complete non-redundant tran the 34,291 longest, non-redunda cripts. Clustering them resulted in redundant transcripts (Figure 7) tha 13,284 selected “Coding” transcrip 43,303 probes. The final panel of related to reproduction, cell differ stress, growth, biosynthetic and cat port, embryonic development and i other functions. The microarray was tested with l salinities (10 and 36 ppt). Hybrid tected for 42,469 probes. A total found differentially expressed (p < were up-regulated and 175 down-re pared to 36 ppt. Application of a (expression ratio) > ±1 filtered 1,48 down-regulated probes. The differe (DEGs) were involved in osmoregu porters and the renin-angiotensin Table 3 SSR summary statistics for whole and reference transcriptomes Type of SSR S. senegalensis S. solea Whole transcriptome 266,434 316,388 Di-nucleotide 107,828 126,260 Tri-nucleotide 96,076 114,198 Tetra-nucleotide 39,102 44,118 Others 23,428 31,812 Reference transcriptome 49,955 67,610 Di-nucleotide 16,405 22,371 Tri-nucleotide 22,394 29,764 Tetra-nucleotide 6,935 8,829 Others 4,221 6,646 Blast-based orthologs 12,418 18,486 Species-specific SSR1 1,273 4,803 Conserved SSR 11,145 13,683 Same repeat motif2 6,596 6,772 Different repeat motif 4,549 6,911 Total number of SSRs and frequency according to their repeat motif are indicated. (1) SSRs present in one species but not in orthologs of the other species. (2) Exactly the same SSR repeat motif was found in both orthologs; in a few cases, SSR occurs once in one ortholog and twice in the other. Benzekri et al. BMC Genomics 2014, 15:952 http://www.biomedcentral.com/1471-2164/15/952 Benzekri et al. BMC Genomics 2014, 15:952 USES
  • 49. BioIn4Next SoleaDB: a huge source molecular markers 23 representation of GATA repeats (<0.2% total repeat mo- tifs) confirmed by FISH analysis (Additional file 9). Com- parison of SSRs Blast-based orthologs in soles (Table 3 [7]. Two species-specific oligo-D been reported in S. senegalensis an limited number of unique transcri number of ESTs available in soles [ was compensated to some extent u croarrays [49]. The sole transcripto study have overcome these restrictio lect sole-specific probes is depicted 5,545 complete non-redundant tran the 34,291 longest, non-redunda cripts. Clustering them resulted in redundant transcripts (Figure 7) tha 13,284 selected “Coding” transcrip 43,303 probes. The final panel of related to reproduction, cell differ stress, growth, biosynthetic and cat port, embryonic development and i other functions. The microarray was tested with l salinities (10 and 36 ppt). Hybrid tected for 42,469 probes. A total found differentially expressed (p < were up-regulated and 175 down-re pared to 36 ppt. Application of a (expression ratio) > ±1 filtered 1,48 down-regulated probes. The differe (DEGs) were involved in osmoregu porters and the renin-angiotensin Table 3 SSR summary statistics for whole and reference transcriptomes Type of SSR S. senegalensis S. solea Whole transcriptome 266,434 316,388 Di-nucleotide 107,828 126,260 Tri-nucleotide 96,076 114,198 Tetra-nucleotide 39,102 44,118 Others 23,428 31,812 Reference transcriptome 49,955 67,610 Di-nucleotide 16,405 22,371 Tri-nucleotide 22,394 29,764 Tetra-nucleotide 6,935 8,829 Others 4,221 6,646 Blast-based orthologs 12,418 18,486 Species-specific SSR1 1,273 4,803 Conserved SSR 11,145 13,683 Same repeat motif2 6,596 6,772 Different repeat motif 4,549 6,911 Total number of SSRs and frequency according to their repeat motif are indicated. (1) SSRs present in one species but not in orthologs of the other species. (2) Exactly the same SSR repeat motif was found in both orthologs; in a few cases, SSR occurs once in one ortholog and twice in the other. Benzekri et al. BMC Genomics 2014, 15:952 http://www.biomedcentral.com/1471-2164/15/952 Benzekri et al. BMC Genomics 2014, 15:952 USES
  • 56. BioIn4Next Browsing ECs 26 More information about this enzyme activity
  • 57. BioIn4Next Overview of KEGG pathways 27Benzekri et al. BMC Genomics 2014, 15:952
  • 58. BioIn4Next Overview of KEGG pathways 27 List of S.senegalensis v4.1 enzymes for this pathway Benzekri et al. BMC Genomics 2014, 15:952
  • 59. BioIn4Next Overview of KEGG pathways 27 List of S.senegalensis v4.1 enzymes for this pathway The complete overview of this pathway Benzekri et al. BMC Genomics 2014, 15:952
  • 62. BioIn4Next Browsing by protein motifs and families 29
  • 63. BioIn4Next Browsing by protein motifs and families 29
  • 64. BioIn4Next Study of apolipoprotein A-IV paralogs 30 was then carried out using SEQBOOT (100 replicates) in the PHYLIP package (Felsenstein, 1989) followed by a Phyml reconstruction (100 replicates) (Guindon and Gascuel, 2003). The consensus phylogenetic tree was subsequently obtained (CONSENSE). Trees were drawn using the Figtree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). Accession numbers for sequences used in the phylogeny are indicated in Supple- mentary file 1. Putative signal peptide was identified using SignalIP (http://www.cbs.dtu.dk/services/SignalP/). Genomic sequences were retrieved after blasting sequences onto a de novo genome assembly for a female sole using Oases software with a 51 k-mer (Benzekri et al., unpublished results). To identify intron and exons boundaries, the two genomic scaffolds containing the apoA- IV gene cluster sequences were aligned with apoA-IV cDNA sequences using Seqman software. Also, a blast analysis (blastx) at NCBI was car- ried out to establish gene synteny and identify other gene coding re- gions. The two scaffold sequences have been deposited at NCBI/EMBL/ DDBJ with accession numbers LC056058 and LC056059. Synteny analy- sis was carried out using ensembl (v79.01) and Genomicus genome browser (http://www.genomicus.biologie.ens.fr/genomicus-79.01/cgi- development. For apoA-IVAa1 and apoA-IVAa2, the incubation time os- cillated between 60 and 105 min (depending on the larval stage), while for apoA-IVBa3 and apoA-IVBa4 a fixed time of 60 min was used in all stages. In all cases, fasted and fed larvae at 3, 5 and 9 dph were al- ways managed in parallel and the same time for color development was given. Twenty animals/sample-treatment/gene were used for each WISH analysis. Digital images were captured using a Leica DFC290 HD digital camera attached to a Leica DMIL LED inverted microscope. 2.4. RNA isolation and RT-qPCR analysis Homogenization of samples, RNA isolation and cDNA synthesis pro- cedures were carried out as previously described (Armesto et al., 2014, 2015). Real-time analysis was carried out on a CFX96™ Real-Time Sys- tem (Bio-Rad) using Senegalese sole specific primers for each apoA-IV transcript (Table 1). Real-time reactions were accomplished in a 10-μL volume containing cDNA generated from 10 ng of original RNA tem- plate, 300 nM each of specific forward and reverse primers, and 5 μL of SYBR Premix Ex Taq (Takara, Clontech). The amplification protocol Table 1 EST information and primer sequences for apoA-IV paralogs. The total number of ESTs (N) encoding for each paralog found at SoleaDB (v4.1; Benzekri et al., 2014) and the unigene ID (v3 and v4.1) for sequences used for CDS(*), 5-(†) and 3-UTR (§) identification are indicated. Moreover, Primer sequences used for probe amplification (¥) and qPCR (‡) analysis and their corresponding amplicons (bp) are also shown. Paralog SoleaDB N Primer name Primer sequence (5′ ➔ 3)′ Size apoA-IVAa1 solea_v3.0_unigene29941* solea_v4.1_unigene546584†§ 35 apoa41fc2(‡) apoa41rc2(‡) ATGGACCCAGAGGCGCTGAAGACCGTA GGCCTGCAGCTCATCAGTGCTCTTGT 90(‡) apoa41_3(¥) apaa41_4(¥) GGACAGGAAGTCAATACCAGGATCGCTCA TAAACAGGAGGTGGAAAGTTGGCTGGAGT 669(¥) apoA-IVAa2 solea_v4.1_unigene431170* solea_v4.1_unigene546431_split_0† solea_v4.1_ unigene 534078§ 14 apoA42F(‡) apoA42R(‡) CCATGCGCACTCAGGTGGCTCCTC CCTCGGCATAGGGCTGCAGATTGGT 132(‡) apoA42_1(¥) apoA42_2(¥) CGACAGTCTGAGCTGGGAAAGG GGCGGCAGCAGGAGAAAATAAC 667(¥) apoA-IVBa3 solea_v3.0_unigene3621* solea_v4.1_unigene14920†§ 24 apoa43_1(‡,¥) apoa43_R(‡) GTCCTCGTTGTGCTCGTCCTTGCTGT CGTGTCCATCACTGGCTTGGGTGCATC 87(‡) apoa43_2 (¥) GCCTGCACCTCCTCGATGTATGGGGAA 719(¥) apoA-IVBa4 solea_v3.0_unigene34222* solea_v4.1_unigene547274†§ 18 SseapoA44_F(‡) SseapoA44_2(‡, ¥) AGCTGAGACACAGAGCCAACCTGGTGA CATTAGCTGGGCTTGGATGTCCTGGGT 107(‡) SseapoA44_1(¥) ATGCCAACCTTCTCTATGCGGATCCAC 689(¥) 86 J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98 Román-Padilla et al. CBP Part B (2016) 191:84-98 Fig. 4. Phylogenetic relationships among the predicted sequences of Senegalese sole apoA-IV paralogs and the corresponding deduced amino acid sequences from other vertebrates (see Supplementary file 1) using the Maximum Likelihood method. The apolipoprotein type and taxonomic group (fish or tetrapod) are indicated on the right. Moreover, the clusters A and B as well as the four subclades (a1–a4) in Acanthopterygii are shown. The apoE sequences were used as outgroup to root tree. Only bootstrap values higher than 50% are indicated on each branch. The scale for branch length (0.4 substitutions/site) is shown below the tree. Species abbreviations: Sse, Solea senegalensis; Cse, Cynoglossus semilaevis; Gac, Gasterosteus aculeatus; Tru, Takifugu rubripes; Ame, Astyanax mexicanus; Dre, Danio rerio; Xtr, Xenopus tropicalis; Hsa, Homo sapiens; Rno, Rattus norvegicus; Mmu, Mus musculus; and Gga, Gallus gallus. 89J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98 and Acanthopterygii. In the former, two or three species-specific paralogs can be found within each cluster depending on the species al- that expression of apoA-IV in YSL could be involved in the efficient mo- bilization of TAG-rich molecules (throughout the formation of VLDL Fig. 14. Transcript abundance of apoA-IV paralogs in different tissues of Senegalese sole juveniles. Data are represented in logarithmic scale. Expression values were normalized to those of 18S rRNA. Data were expressed as the mean fold change (mean + SEM, n = 3) from the calibrator group (kidney). Different letters denote tissues that are significantly different from liver (P b 0.05). 95J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98 USES
  • 67. BioIn4Next Ready for gene expression and more 32
  • 68. BioIn4Next Retrieving SoleaDB by sequence homology 33Benzekri et al. BMC Genomics 2014, 15:952
  • 69. BioIn4Next Retrieving SoleaDB by sequence homology 33Benzekri et al. BMC Genomics 2014, 15:952 Paste your sequence Or upload your file of sequences
  • 70. BioIn4Next Retrieving SoleaDB by sequence homology 33Benzekri et al. BMC Genomics 2014, 15:952 Paste your sequence Or upload your file of sequences Select your preferred assemblies
  • 71. BioIn4Next Retrieving SoleaDB by sequence homology 33Benzekri et al. BMC Genomics 2014, 15:952 Paste your sequence Or upload your file of sequences Select your E-value filter Select your preferred assemblies
  • 72. BioIn4Next Retrieving SoleaDB by keywords 34Benzekri et al. BMC Genomics 2014, 15:952
  • 73. BioIn4Next Retrieving SoleaDB by keywords 34Benzekri et al. BMC Genomics 2014, 15:952
  • 74. BioIn4Next Soles retained the crystallin genes 35Benzekri et al. BMC Genomics 2014, 15:952 Figure 6 Phylogenetic tree of Crybb and Crybb-like proteins in vertebrates. A neighbor-joining tree based on the alignment of vertebrates Crybb and Crybb-like sequences was built. Species are indicated as Sse (Solea senegalensis), Sso (Solea solea) Dre (Danio rerio), Tni (Tetraodon nigroviridis), Oni (Oreochromis niloticus), Ola (Oryzia slatipes), Cse (Cynoglossus semilaevis), Xla (Xenopus laevis) and Gga (Gallus gallus; see Additional file 7 for accession numbers). Solea sequences are indicated according to the transcript name assigned in SoleaDB. Clusters are indicated as arcs of a circle. The tree obtained was rooted using Xenopus laevis Cryga. Numbers adjacent to nodes indicate percentage bootstrap support; only values larger than 70% Benzekri et al. BMC Genomics 2014, 15:952 Page 10 of 18 http://www.biomedcentral.com/1471-2164/15/952 Fish-specific cristallin? Fish-specific cristallin? Absent in flatfish USES
  • 75. BioIn4Next Tisochrysis lutea database 36 Tisochrysis lutea http://www.scbi.uma.es/isochrysisdb/ H. Benzekri (2016)
  • 76. BioIn4Next Tisochrysis lutea database 36 Tisochrysis lutea http://www.scbi.uma.es/isochrysisdb/ Quite similar to other microphytes (microalgae) H. Benzekri (2016)
  • 79. BioIn4Next Most Ruditapes genes seem to be identified 38H. Benzekri (2016) 1 Illumina library: 
 127 × 106 reads 2 × 75 nt USES
  • 80. BioIn4Next Most Ruditapes genes seem to be identified 38H. Benzekri (2016) Too many small transcripts 1 Illumina library: 
 127 × 106 reads 2 × 75 nt USES
  • 81. BioIn4Next Most Ruditapes genes seem to be identified 38H. Benzekri (2016) Too many small transcripts 1 Illumina library: 
 127 × 106 reads 2 × 75 nt Unique orthologues: 12 764 (32%) Ruditapes philippinarum: 9 747 genes USES
  • 82. BioIn4Next Bioinformatics tools based on genomes 39 e production technologies and applications to marine fish aquaculture” El Puerto de Santa María, 20-24 Junio IFAPA centro El Toruño
  • 83. BioIn4Next Two Photobacterium damselae subsp. piscicida 40 144 RESULTADOS Y DISCUSIÓN Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21 Cepas Referencia a las figura IV.43 y IV.44 L091106-03H DI21 Total lecturas #1 Pareadas 148 622 433 717 Simples 297 269 187 433 Longitud media Pareadas 509 445 Simples 1 195 550 Lecturas rechazadas #2 Pareadas 48 403 (32,6 %) 238 804 (55 %) Simples 49 530 (16,7 %) 53 251 (28,4 %) Contaminación Pareadas 21556 (14,5 %) 62761(14,5 %) Simples 46766 (15,7 %) 37791 (20,1 %) Total de lecturas útiles #3 382 755 396 450 Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %) Lecturas simples 313 437 263 900 Desde la librería de pareadas (Lecturas no emparejadas) 65 553 (44,1 %) 129 264 (29,8 %) Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %) IV.3.1.2. Ensamblaje El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de 14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del borrador de genoma de L091106-03H. M. Gonzalo Claros Díaz 10/11/2015 17:09 Comentario [8]: No olvides completarlo M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: fueron M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: dos M. Gonzalo Claros Díaz 10/11/2015 17:09 RESULTADOS Y DISCUSIÓN 80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21 NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el mblaje de esta cepa. Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21 Cepas L091106-03H (v2) DI21 Número de scaffolds > 500 pb 14 17 El scaffolds más largo 2 323 982 2 798 534 El scaffolds más corto 1 007 437 Suma de longitudes 4 194 408 4 316 437 Número de N 341 126 561 264 Longitud medía 299 600 227 180 N50 2 323 982 2 798 534 N90 157 598 152 634 Contenido G+C 40% 40,6% 1.4. Anotación de los dos borradores de genomas La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11 Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad del alineamiento entre las proteínas Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2 152 RESULTADOS Y DISCUSIÓN Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190] Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que en general los genomas de las dos cepas son colineales. H. Benzekri (2016) USES
  • 84. BioIn4Next Two Photobacterium damselae subsp. piscicida 40 144 RESULTADOS Y DISCUSIÓN Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21 Cepas Referencia a las figura IV.43 y IV.44 L091106-03H DI21 Total lecturas #1 Pareadas 148 622 433 717 Simples 297 269 187 433 Longitud media Pareadas 509 445 Simples 1 195 550 Lecturas rechazadas #2 Pareadas 48 403 (32,6 %) 238 804 (55 %) Simples 49 530 (16,7 %) 53 251 (28,4 %) Contaminación Pareadas 21556 (14,5 %) 62761(14,5 %) Simples 46766 (15,7 %) 37791 (20,1 %) Total de lecturas útiles #3 382 755 396 450 Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %) Lecturas simples 313 437 263 900 Desde la librería de pareadas (Lecturas no emparejadas) 65 553 (44,1 %) 129 264 (29,8 %) Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %) IV.3.1.2. Ensamblaje El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de 14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del borrador de genoma de L091106-03H. M. Gonzalo Claros Díaz 10/11/2015 17:09 Comentario [8]: No olvides completarlo M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: fueron M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: dos M. Gonzalo Claros Díaz 10/11/2015 17:09 RESULTADOS Y DISCUSIÓN 80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21 NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el mblaje de esta cepa. Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21 Cepas L091106-03H (v2) DI21 Número de scaffolds > 500 pb 14 17 El scaffolds más largo 2 323 982 2 798 534 El scaffolds más corto 1 007 437 Suma de longitudes 4 194 408 4 316 437 Número de N 341 126 561 264 Longitud medía 299 600 227 180 N50 2 323 982 2 798 534 N90 157 598 152 634 Contenido G+C 40% 40,6% 1.4. Anotación de los dos borradores de genomas La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11 Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad del alineamiento entre las proteínas Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2 N50 is provided by the longest contig 152 RESULTADOS Y DISCUSIÓN Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190] Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que en general los genomas de las dos cepas son colineales. H. Benzekri (2016) USES
  • 85. BioIn4Next Two Photobacterium damselae subsp. piscicida 40 144 RESULTADOS Y DISCUSIÓN Tabla IV.25: Resumen del pre-procesamiento de las lecturas originales de L091106-03H y DI21 Cepas Referencia a las figura IV.43 y IV.44 L091106-03H DI21 Total lecturas #1 Pareadas 148 622 433 717 Simples 297 269 187 433 Longitud media Pareadas 509 445 Simples 1 195 550 Lecturas rechazadas #2 Pareadas 48 403 (32,6 %) 238 804 (55 %) Simples 49 530 (16,7 %) 53 251 (28,4 %) Contaminación Pareadas 21556 (14,5 %) 62761(14,5 %) Simples 46766 (15,7 %) 37791 (20,1 %) Total de lecturas útiles #3 382 755 396 450 Lecturas pareadas #4 69 318 (23,3 %) 132 550 (15,3 %) Lecturas simples 313 437 263 900 Desde la librería de pareadas (Lecturas no emparejadas) 65 553 (44,1 %) 129 264 (29,8 %) Desde la librería de simples 247 884 (83,4 %) 134 636 (71,8 %) IV.3.1.2. Ensamblaje El primer genoma que ensamblamos fue el de L091106-03H, ya que fue el primero del cual recibimos los datos de secuenciación. El conocimiento del tamaño del genoma de DI21 (4,77 Mb) nos permitió hacer una aproximación de la cobertura de las lecturas de L091106-03H que resultó ser de 14x, un dato muy bajo de cobertura según los parámetros que previamente habíamos calculado para realizar un ensamblaje correcto (apartado XXXX). Para realizar el proceso de ensamblaje de estas lecturas, y en función de los resultados generados en las pruebas sobre lecturas genómicas de tipo Roche/454 (apartado IV.1.2.2), se seleccionó el programa CABOG [55], ya que es el más preciso en los casos donde las coberturas son bajas. La estrategia de ensamblaje utilizada se ilustra en la figura IV.43, donde CABOG generó 510 contigs y 25 scaffolds, los cuales formaron la versión 1 del borrador de genoma de L091106-03H. M. Gonzalo Claros Díaz 10/11/2015 17:09 Comentario [8]: No olvides completarlo M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: fueron M. Gonzalo Claros Díaz 10/11/2015 17:09 Eliminado: dos M. Gonzalo Claros Díaz 10/11/2015 17:09 RESULTADOS Y DISCUSIÓN 80 pb y un porcentaje G+C de 40,6%. En ambos borradores, el scaffold más largo supera la del genoma (>2 Mb), por lo que el N50 iguala la longitud de este scaffold. Por lo tanto, se puede ir que el ensamblaje fue equivalente para ambas cepas. Como el borrador del genoma de DI21 NCBI (GCA_000300355.3) contenía 56 scaffolds con 846 993 indeterminaciones (N), podemos r que, al tener el nuevo borrador solo 19 scaffolds y menos N (561 264), se ha mejorado el mblaje de esta cepa. Tabla IV.27: Características del borrador de genoma final de L091106-03H y DI21 Cepas L091106-03H (v2) DI21 Número de scaffolds > 500 pb 14 17 El scaffolds más largo 2 323 982 2 798 534 El scaffolds más corto 1 007 437 Suma de longitudes 4 194 408 4 316 437 Número de N 341 126 561 264 Longitud medía 299 600 227 180 N50 2 323 982 2 798 534 N90 157 598 152 634 Contenido G+C 40% 40,6% 1.4. Anotación de los dos borradores de genomas La anotación de los borradores de genomas de L091106-03H y DI21 de se llevó a cabo con el ma de anotación automática RAST (Rapid Annotation using Subsystem Technology) [125] M. Gonzalo Claros Díaz 10/11/2015 17:11 Figura IV.47: Similitud entre L091106-03H v2 y otras bacterias basada sobre el porcentaje de identidad del alineamiento entre las proteínas Figura IV.48: Representación dotplot de los alineamientos nucleotídicos entre L091106-03H (v2) y DI2 N50 is provided by the longest contig 152 RESULTADOS Y DISCUSIÓN Figura IV.50 : Visualización de la sintenía entre los borradores de genomas de L091106-03H v2 y DI21 en función de la correspondencia obtenida con las proteínas de Photobacterium damselae (identidad mínima del 97% en ambas especies). Las coincidencias fueron representadas con Circos [190] Para comprobar la colinealidad entre los genes de las dos cepas, la disposición de los CDS en los genomas de L091106-03H v2 y DI21 fue observada utilizando SEED Viewer [191], que está integrado con el programa de anotación RAST (http://rast.nmpdr.org). En la figura IV.51 se muestran dos ejemplos de la disposición de dos grupos de genes ortólogos en los genomas de las cepas. En el primer ejemplo (figura IV.51-A) se observa que el orden de los genes ortólogos está bien conservado entre las dos cepas mientras que en el segundo ejemplo (figura IV.51-B) se nota que el grupo de genes ortólogos (6, 22, 31, 30, 29, 32 y 35) está localizado en medio de otros genes que son diferentes lo que indica que este grupo de genes ortólogos se encuentra en dos zonas distintas entre los scaffolds 5 de L091106-03H v2 y el scaffold 11 de DI21, además se nota que el orden de estos genes ortólogos no está conservado ya que el gen 30 tiene una posición relativa diferente entre los dos genomas. La ocurrencia de esta figura fue muy rara pero confirma la hipótesis de que hubieron algunas reorganizaciones en los genomas durante la evolución de las dos cepas. En cambio, la primera figura, donde el orden de los genes ortólogos esta conservado, fue la más predominante indicando que en general los genomas de las dos cepas son colineales. Both pathogenic strains are highly syntenic H. Benzekri (2016) USES
  • 86. BioIn4Next Photobacterium-DB for browsing genomes 41 http://www.scbi.uma.es/photobacterium_damselae/ H. Benzekri (2016)P. Seoane-Zonjic (2016)
  • 87. BioIn4Next Photobacterium-DB for browsing genomes 41 http://www.scbi.uma.es/photobacterium_damselae/ H. Benzekri (2016)P. Seoane-Zonjic (2016)
  • 89. BioIn4Next Solea senegalensis genome assembling approach 43 2 × 75 nt Female 3 kb paired-ends Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads H. Benzekri (2016) Long paired-ends Female
  • 90. BioIn4Next Solea senegalensis genome assembling approach 43 2 × 75 nt Female 3 kb paired-ends Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 H. Benzekri (2016) Long paired-ends Female
  • 91. BioIn4Next Solea senegalensis genome assembling approach 43 2 × 75 nt Female 3 kb paired-ends Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 NUCMER - GAM-NGS - SSPACE - GAPcloser Breaking into artificial reads Final scaffolds 34 176 H. Benzekri (2016) Long paired-ends Female
  • 92. BioIn4Next Solea senegalensis genome assembling approach 43 2 × 75 nt Female 3 kb paired-ends Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 NUCMER - GAM-NGS - SSPACE - GAPcloser Breaking into artificial reads Final scaffolds 34 176 Longest: 638 263 nt Mean length: 14 565 nt N50: 85 596 nt Total Length: 600 Mbp H. Benzekri (2016) Long paired-ends Female
  • 93. BioIn4Next Chr4 Chr6 Chr8 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Cynoglossus semilaevis and soles are highly syntenic 44 Chr4 Chr6 Chr8 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Based on protein identity > 70% Based on transcript identity H. Benzekri (2016)Manchado et al (2016), in press USES
  • 94. BioIn4Next Chr4 Chr6 Chr8 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Cynoglossus semilaevis and soles are highly syntenic 44 Chr4 Chr6 Chr8 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Based on protein identity > 70% Based on transcript identity H. Benzekri (2016)Manchado et al (2016), in press USES
  • 95. BioIn4Next Chr4 Chr6 Chr8 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Cynoglossus semilaevis and soles are highly syntenic 44 Chr4 Chr6 Chr8 Chr10 Chr11 Chr12 Chr13 Chr14 Chr15 755 752 720 701 695 688 681 678 228 Based on protein identity > 70% Based on transcript identity 164 RESULTADOS Y DISCUSIÓN algunos puedan contener zonas del genoma o genes propios al lenguado senegalés que no están (o son muy diferentes) en Cynoglossus semilaevis. Figura IV.58: Ejemplo de alineamiento entre el Scaffod 1145 de S. senegalensis y el cromosoma 1 de C. Semilaevis. Las zonas mostradas tienen un tamaño aproximativo de 150 kb. Se nota que fragmentos alineados se H. Benzekri (2016)Manchado et al (2016), in press USES
  • 96. BioIn4Next One step beyond: from saffolds to chromosomes 45 Long reads Female AQUAGENET1 Female AQUAGENET3 Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 NUCMER - GAM-NGS - SSPACE - CAPcloser Breaking into artificial reads Final scaffolds 34 176 Longest: 638 263 nt Mean length: 14 565 nt N50: 85 596 nt Total Length: 600 Mbp H. Benzekri (2016)
  • 97. BioIn4Next One step beyond: from saffolds to chromosomes 45 Long reads Female AQUAGENET1 Female AQUAGENET3 Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 NUCMER - GAM-NGS - SSPACE - CAPcloser Breaking into artificial reads Final scaffolds 34 176 Longest: 638 263 nt Mean length: 14 565 nt N50: 85 596 nt Total Length: 600 Mbp ICMapper Super-scaffolds C. semilaevis Chromosomes 22 H. Benzekri (2016)
  • 98. BioIn4Next One step beyond: from saffolds to chromosomes 45 Long reads Female AQUAGENET1 Female AQUAGENET3 Female 8.7 × 108 reads 11.1 × 108 reads 8.3 × 106 reads RAY Scaffolds Scaffolds RAY 213 548 278 995 NUCMER - GAM-NGS - SSPACE - CAPcloser Breaking into artificial reads Final scaffolds 34 176 Longest: 638 263 nt Mean length: 14 565 nt N50: 85 596 nt Total Length: 600 Mbp 8 538 scaffolds Longest: 638 263 nt Mean length: 54 673 nt N50: 105 233 nt Total Length: 466.7 Mbp ICMapper Super-scaffolds C. semilaevis Chromosomes 22 H. Benzekri (2016)
  • 99. BioIn4Next S. senengalensis superscaffolds validated by molecular markers 46 Already established linkage groups 113/129 SSR validated H. Benzekri (2016)Manchado et al (2016), in press USES
  • 100. BioIn4Next S. senengalensis superscaffolds validated by molecular markers 46 Already established linkage groups 113/129 SSR validated H. Benzekri (2016)Manchado et al (2016), in press USES
  • 101. BioIn4Next S. senengalensis superscaffolds validated by molecular markers 46 New markers 88/113 validated Already established linkage groups 113/129 SSR validated H. Benzekri (2016)Manchado et al (2016), in press USES
  • 102. BioIn4Next S. senengalensis superscaffolds validated by molecular markers 46 New markers 88/113 validated Already established linkage groups 113/129 SSR validated Females lack Chr W → XY system? H. Benzekri (2016)Manchado et al (2016), in press USES
  • 103. BioIn4Next Gene structure and synthey of apolipoproteins A-IV 47 USES Román-Padilla et al. CBP Part B (2016) 191:84-98 block followed by a long domain containing 9 putative tandem repeats flanked by the unrelated coding regions (UCR) 1 and 2 (Fig. 2). The com- mon block was located into the exon 3 (except for apoA-IVAa1 in the exon 2) and could be divided into the A, B and C segments. Seven out of the 9 putative tandem repeats were 22-mer in length and contained ters according the genomic clusters A and B, as described above. In Ostariophysi, the apoA-IV duplicates within each cluster appeared close- ly related each other in the same branch indicating a high similarity be- tween intraspecific paralogs. In contrast, the apoA-IV duplicates within each cluster in Acanthopterygii could be splitted into two clearly Fig. 1. Gene structure of the four apoA-IV paralogs in Senegalese sole. The wide bars represent the exons, and thin lines the introns. The wide bars in red represent the 5′ and 3′ untranslated regions whereas the ORF is shown in blue indicating signal peptides (dark blue) from the mature peptide (light blue). The size of exons and introns is also indicated. Only the length of the exons is drawn to scale. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) ades (referred to as a1 an a2 for cluster A and a3 and a4 According to this phylogenetic tree, we named each the genomic cluster and the Acanthopterygii subclade they belonged to. Nevertheless, it should be noted that not all Acanthopterygii species bear the four paralog types. G. aculeatus lacked the apoA-IVAa1 and had two apoA-IVAa2-like paralogs (referred to as 1 no acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows indicate the position of in- ptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repeats (A, B, C) of the common indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline residues 117, 129 and 183 are k. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, the reader is referred to the article.) J. Roman-Padilla et al. / Comparative Biochemistry and Physiology, Part B 191 (2016) 84–98 separated subclades (referred to as a1 an a2 for cluster A and a3 and a4 for cluster B). According to this phylogenetic tree, we named each paralog adding the genomic cluster and the Acanthopterygii subclade they belonged to. Nevertheless, it should be Acanthopterygii species bear the four paralog type the apoA-IVAa1 and had two apoA-IVAa2-like para Fig. 2. Deduced amino acid sequences of the four apoA-IV paralogs in Senegalese sole. Dots indicate amino acids identical to those of apoA-IVAa1. Blue arrows trons. The signal peptide cleavage site is marked by a vertical bar. The unrelated coding regions 1 and 2 (UCR1 and UCR2, respectively) as well as the three repe 33-codon block are indicated and their residues are numbered. The 22-mer repeats are boxed and the P(Y/H)A motifs are shaded. The conserved proline resi indicated by asterisk. The cleave site for matrix metalloproteinase 7 are denoted by $. (For interpretation of the references to color in this figure legend, th web version of this article.) Fig. 3. Physical synteny of apoA-IV paralogs. Cluster A. Synteny for apoA-IVAa1 and apoA-IVAa2 paralogs. Cluster B, synteny for apoA-IVBa3 and apoA-IVBa4 paral the chromosome or scaffold location are indicated on the right. Each gene is represented by a color within each cluster. The coding direction is indicated by the p indicate non-syntenic genes. “*” in T. rubripes denotes a gene identified by sequence analysis, not available in Genomicus platform “**” indicates an Apo (ENSDARG00000095050). Gene names: apoC-I, apolipoprotein C-I; apoC-II, apolipoprotein C-II; apo14, apolipoprotein 14 kDa; apoEa and apoEb, apolipoprote (Asp-Glu-Ala-Asp) box polypeptide 6; lipea, lipase, hormone-sensitive a; mep1b, meprin A, beta; msto1, misato 1, mitochondrial distribution and morphology nine-rich splicing factor 4; and tomm40, translocase of outer mitochondrial membrane 40 homolog.
  • 104. BioIn4Next Genosole: a database for S. senegalensis genome draft 48 http://www.scbi.uma.es/GenoSole/ P. Seoane-Zonjic (2016) COMING SOON