Materials and Methods

One of the purposes of this study was to create a computer program which allows the eficient obtention of the selenoproteins in a genome, in our case, Neotoma lepida. In order to identify these selenoproteins, an homology based approach using a well annotated reference genome was used.

We decided to compare our genome with Mus musculus due to the fact that it is the closest species in evolutive thermes and because of this it is expected to have a similar selenoproteome to Neotoma lepida. Due to the fact that not all the expected selenoproteins could be found we also compared our genome with Rattus norvegicus, since it is also a close species, and with Homo sapiens, since it is the best annotated genome.

In order to acomplish the purpose of our study, several steps were followed:

Obtention of Neotoma lepida genome and the querys

The genome of our species, Neotoma lepida, was facilitated by the university and can be found as a multifasta file in the following directory:


The sequence of Mus musculus selenoproteins, with which the genome of our organism was confronted, was obtained from SelenoDB. Mus musculus selenoproteins where obtained in a multifasta file, so the first step in our program was to separate the different proteins in singlefasta documents. The script used in order to do this can be consulted here.

Change the U for X

The first step after acquiring the querys was to change the U symbol for an X, which was used to represent the selenocystein aminoacid in the proteins obtained from SelenoDB, since some of the programs used, such as Exonerate, T-Coffee and GeneWise, might not be able to read it correctly. In the same step, the erroneous symbols, such us %, @ and #, that were included in the protein sequence were also eliminated.

The script used to make these changes can be consulted here.

Comparing Neotoma lepida genome with Mus musculus selenoproteins sequence

In order to find the selenoproteins in Neotoma lepida we confronted its genome with the sequences of Mus musculus selenoproteins. To do this we used the program BLAST (Basic Local Alignment Search Tool). BLAST uses an heuristic algorithm to compare the two sequences in order to find similarities between them, giving the statistic significance of these similarities. The statistic significance allows us to apply a threshold to keep just those hits which are less probable to be obtained by chance and therefore the ones which are more significant. In this project a significance threshold to keep just those hits with an e-value lower than 0,01 was applied.

There exist different versions of BLAST. The one which was used in this project is tblastn, which compares a sequence of aminoacids in a protein, the sequence of the selenoproteins in Mus musculus, with a sequence of nucleotides, the sequence in the genome of Neotoma lepida.

The model command which needs to be used is

tblastn -query queryfile.fa -db nombbddBLAST -out outputfile

The -query needs to be the reference sequence, so the sequences of the selenoproteins in Mus musculus. The -db needs to be the genome of Neotoma lepida, as it is the database in which we want to search our selenoproteins of interest. The -out needs to be the file that wants to be generated.

Therefore, the final command that we used in our program was:

tblastn -query $dirname/$protein/$protein.U_X.fa -db /cursos/20428/BI/genomes/2016/Neotoma_lepida/genome.fa -evalue 0.01 -outfmt 7 -out $dirname/$protein/$protein.blast.fa

After applying this command for each alignment found we obtained: an scaffold, which tells us the location of the alignment in the Neotoma lepida genome; the Identity, which indicates how similar the two sequences are; the start and end position, between which our hit is compressed and an e-value, which informs of the number of times we expect to find by chance an alignment as good as the one obtained in the database.

Obtention of the segments of interest

After the obtention of BLAST we needed to extract those parts of the genome of Neotoma lepida which match with the selenoproteins in the reference genome.

The first step to do this usually would be the indexation of the Neotoma lepida genome, in order to obtain different sections of our genome, known as contigs or scaffolds, but in this case the index was already done. The command which needs to be used when the indexation of the database is needed is:

fastaindex inputfile.fa indexedfile.index

Where the input file is the database, therefore Neotoma lepida's genome.

Once the genome under study is indexed, it is necessary to separate those contigs of it which contain the sequences with a proper alignment with the reference selenoproteins from the rest of the genome.

To do this, the command fastafetch was used:

fastafetch inputfile.fa indexedfile.index $contig > contigname.fa

Where the input file is the database and the indexedfile is the one obtained using fastaindex.

Therefore, the command which we used was:

fastafetch /cursos/20428/BI/genomes/2016/Neotoma_lepida/genome.fa /cursos/20428/BI/genomes/2016/Neotoma_lepida/genome.index '$contig' > $dirname/$protein/$protein" . "_" . "$contig.fa

After having the contig containing our gene of interest isolated it was necessary to bound the concrete region which contains the selenoprotein. To make sure that the region includes the whole gene the sequence of the hit needs to be extended in both sides, ideally 50.000 nucleotides per side, since eukaryotic genes contain introns. Due to the fact that not all the contigs are big enough, before doing this step, an extra command was used in order to know the length of each contig. This command was:

fastalength $dirname/$protein/${protein}_${contiganterior}.fa

Then, when the contig containing the region of interest is shorter than 100.000 nucleotides the whole contig was considered. If the contig is longer than this, the command fastasubseq was used to obtain the sequence of the alignment.

fastasubseq scaffold.fa startposition length > subseq.fa

Where scaffold.fa is the file containing the contig of interest and start position and length is the parameters of the alignment obtained from BLAST.

Therefore, the command used in our case was:

fastasubseq $dirname/$protein/${protein}_${contiganterior}.fa $startfastasubseq $length > $dirname/$protein/${protein}_${contiganterior}.subseq.fa

Genes prediction

After isolating our region of interest from the rest of the genome exonerate was used in order to compare it with the model protein, obtained from Mus musculus. Exonerate generates the most probable alignment between the two sequences in order to make a prediction of the sequence of nucleotides in our gene. Exonerate also predicts which are the coding regions, the exons, and the non-coding regions, the introns.

The general command used to do exonerate is:

exonerate -m p2g –-exhaustive yes --showtargetgff -q query.fa -t fastasubseq.fa > prediction.exonerate.gff

Where the query.fa is the file with the Mus musculus selenoprotein and fastasubseq is the file which contains the fragment extracted from Neotoma lepida genome. The exhaustive function for exonerate was used in order to obtain the maximum sensitivity for the alignment.

Therefore, the concrete command we used was:

exonerate -m p2g --exhaustive yes --showtargetgff -q '$dirname/$protein/$protein.U_X.fa' -t $dirname/$protein/$filename > $dirname/$protein/exonerate/genspredits/$exonername

Actually, the part of the sequence in which we have interest is the coding part, so we are only interested in the exons. Because of that, we used a second command to generate an other file containing only the exons of the gene. The command used was:

egrep -w exon > prediction.exonerate.gff

Where exon is the file generated with exonerate.

So the command that we used was:

egrep -w exon $dirname/$protein/exonerate/genspredits/$exonername > $dirname/$protein/exonerate/exons/$exonname

Obtaining the sequence of nucleotides forming cDNA

Once we know which is the coding part of our sequence it is necessary to know which cDNA is going to be generated considering the exons predicted. To do that we use fastaseqfromGFF: exons.egrep.gff > exons.fa

Where exons.egrep.gff is the file where the exons of our gen have been printed. So the concrete command used was: $dirname/$protein/$filename $dirname/$protein/exonerate/exons/$exonname > $dirname/$protein/exonerate/cDNA/$cDNAname

Predicting the protein sequence

Having cDNA, all that needs to be done is to translate the nucleotide sequence into the amino acid sequence, since amino acids are what really conformates the proteins. The general command is:

$ fastatranslate –f exons.fa –F 1 > protein.fa

Where exons.fa were the file containing the predicted cDNA.

Therefore the concrete command which we used was:

fastatranslate -f $dirname/$protein/exonerate/cDNA/$cDNAname -F 1 > $dirname/$protein/exonerate/protpredites/$translate

Before applying t-coffee we needed to change the * for U in the file latter generated.

The program used to make these changes can be found here.

Comparing the protein predicted with the model protein

Once the amino acid sequence of the protein has been predicted we applied t-coffee (Tree-based Consistency Objective Function Evaluation for alignment) in order to compare our prediction with the model protein obtained from Mus musculus. The aim of this step is to determine if the two proteins are homologous and if there is a highly conserved aminoacid sequence or not, being specially important for us the selenocystein aminoacid.

The general command is:

t_coffee FASTAsequence1.fa FASTAsequence2.fa

Where the sequence1 is the sequence obtained from Mus musculus and the sequence2 is the one predicted for Neotoma lepida.

Therefore, the concrete command that we used was:

t_coffee '$dirname/$protein/$protein.U_X.fa' $dirname/$protein/exonerate/protcanviX_U/$canviu > $dirname/$protein/Tcoffee/exonerate/$alineament


GeneWise can be used in order to obtain an alternative result for the alignment so as we can have a more contrasted information or an alternative in those cases where the t-coffee is not correct. GeneWise compares a nucleotidic sequence with an aminoacidic sequence. Therefore, exonerate is not needed in this case and the file used is the one generated from fastasubseq command.

The general command is:

genewise -pep -pretty -cdna -gff modelprotein.fa subseq.fa

Where the modelprotein.fa is the Mus musculus protein after changing the U for X, and subseq.fa is the nucleotidic sequence of the Neotoma lepida predicted protein.

In our case, two different commands, one for the reverse strand and one for the forward strand, were used:

Forward strand: genewise -pep -pretty -cdna -gff '$dirname/$protein/$protein.fa' $dirname/$protein/$filename > $dirname/$protein/genewise/$gwise

Reverse strand: genewise -pep -pretty -trev -cdna -gff '$dirname/$protein/$protein.fa' $dirname/$protein/$filename > $dirname/$protein/genewise/$gwise

Automatation of the process

In the previous sections the main parts of the program have been explained separately, but in order to make the selenoprotein obtention easier a Perl script was created. The script can be consulted here.

Here you can obtain the bash script which includes all the exports needed to execute the program and also gives the order for the program initiation.

Searching for the SECIS

SECIS elements are structural motifs formed by around 60 nucleotides which adopt a stem-loop structure that is located in the 3'-UTR of the genes encoding for a selenoprotein. These elements are essential for the UAG codon to be considered as a selenocystein codon, which will be translated, and not as a STOP codon. Because of this, the presence of a SECIS element is necessary for the confirmation of predicted selenoproteins.

In order to identify SECIS and its location Seblastian and SECISearch3 can be used. These are online tools in which you can introduce your predicted selenoprotein sequence and it looks for possible SECIS elements.