The aim of our study is to characterize all the sequences codifying for selenoproteins in the genome of Monopterus albus and to determine its machinery proteins. To perform this study we considered Danio rerio as it is phylogenetically close to our specie. Therefore, its detailed genome has been used to compare its previously notated selenoprotein sequences with Monopterus albus genome in an homology-based approach.
Danio rerio genome was chosen due to its phylogenetic proximity to Monopterus albus. Zebrafish is also a well known model organism and they are both found in the Actinopterygii class. The genome of Monopterus albus was obtained from the following source on the terminal, created by our professors:
The index genome was also obtained from the following source:
The selenoproteins of Danio rerio were obtained through the website of SelenoDB and were all saved in the same file substituting its ID for its name manually.This file was called protzebrafish.fa. Afterwards, we separated each protein from the common file into a separate FASTA file called protein.fa
Our protein list from Zebrafish included: Sel15, eEFsec, SELENOE, GPx1a, GPx1b, GPx2, GPx3a, GPx3b, GPx4a, GPx4b, GPx7, GPx8, DIO1, DIO2, DIO3a, DIO3b, MsrA, PSTK, SBP2, SecS, SEPHS, SEPHS2, SELENOH, SELENOI, SELENOJ1, SELENOK, SELENOL, SELENOM, SELENON, SELENOO1, SELENOO2, SELENOP, SELENOP2, MSRB1a, MSRB1b, MSRB2, MSRB3, SELENOS, SELENOT1, SELENOT1b, SELENOT2, SELENOU1a, SELENOU2, SELENOU3, SELENOW, TXNRD2, TXNRD3, SECp43a, SECp43b.
To start with the prediction of selenoproteins in Monopterus albus we had to load the following modules to run all the programmes:
|module load modulepath/goolf-1.7.20 |
module load BLAST+/2.2.30-goolf-1.7.20
module load Exonerate/2.2.0-goolf-1.7.20
module load T-Coffee/11.00.8cbe486-goolf-1.7.20
Firstly, we performed a tBLASTn for each query in order to do comparisons between Danio rerio amino acid queries and the indexed genome of Monopterus albus:
|tblastn -query queryfile.fa -db BLASTname -out outfilename|
After performing the tBLASTn, we obtained a table containing: the contig in which the hit was found, the percentage of similarity, the start position of the hit, the end position of the hit and the E-value which is a measure of the significance of the hit. We selected those contigs which had the best E-value and the best coverage manually. For the protein families, we compared the hits of the tBLASTn individually and assigned them manually. For families like the GPx or the DIO, lots of similarities were found and hence we got very similar hits for their members, and the same hits gave similar e-values on different proteins. In order to determine which scaffold would be assigned to every protein we made an excel sheet with a table comprising the e-values, the length, the % of similarity and the query protein coverage. Considering this, we first assigned the most clear results and then the most ambiguous ones, based basically on the coverage and the length in those cases where the e-values were very similar.
For those proteins that had the same name but different sequence we only chose the one that had a better E-value and eliminated the rest.
After the manual selection of the best contig for each protein, we created a file with the sequence of the scaffold. This was made with the fastafetch command:
|fastafetch genome.fa genome.index "contig" > contig.fa|
Then, we put the region of interest of that scaffold in a file called genomic.fa. This region of interest was 50.000 nuceotides to each side of the first nucleotide of the first exon. To do so, we first wrote the start position of the sequence which was the position minus 50.000 nucleotides so that we could work with the whole selenoprotein and we also put the length of the sequence which was 100.000. We did it using the command fastasubseq:
|fastasubseq genome.fa start length > genomic.fa|
Once we got the scaffold sequence in a file, we predicted the exons that conform the protein of Monopterus albus with two commands: exonerate and egrep. Exonerate predicts the sequence that would contain our protein of interest by aligning the obtained sequence from fastasubseq with the initial query sequence:
|exonerate -m p2g --showtargetgff -q protein.fa -t genomic.fa|
The output we obtain from exonerate labels the exons and introns in any identified part region of the protein. Afterwards, we used egrep command to extract the exons of the predicted protein:
|egrep -w exon > protein.exonerate.gff|
With FastaseqfromGFF we could translate our coordinates in a GFF format to FASTA format :
|fastaseqfromGFF.pl genomic.fa protein.exonerate.gff|
To translate the previously obtained sequence from nucleotides to amino acids we used the fastatranslate command:
|fastatranslate -f protein.scaffold.nt -F 1 > protein.scaffold.p|
Finally, we used t-coffee to predict the alignment between Danio rerio protein and Monopterus albus predicted protein:
|T_coffee Daniorerioprotein.fa protein.scaffold.p > protein.scaffold.tcoffee|
We used Perl in order to automate the prediction of Monopterus albus selenoproteins and saved it as programa.pl. The program had all the steps of the prediction automatized so that all the files generated in each step were saved in a folder with the name of the protein.
Our program performed all the steps detailed above automatically except from the tBLASTn. We did a common tBLASTn in a tabular format for all our query proteins and saved it as a file. Then, every time we wanted to run the program for a specific protein, we had to search the protein in the tblastn output and look for the scaffold and the starting nucleotide: the inputs we gave to the program.
We decided to design it this way because we thought that manually was the optimal way to decide which scaffold we would assign to each protein, as in our first attempt to analyze it we realized that most of the proteins had multiple hits from different scaffolds, and that the first hit was not always the one we would choose. This is why the process of assigning the scaffold to the protein was done manually, to optimize the process of scaffold-protein assignation. We were able to do it manually because we worked with a relatively small number of proteins (n=53 in zebrafish). If we had had to analyze a larger number of proteins or multiple species this would had to be done automatically, using a much more complex algorism comparing the coverage, the identity and the e-value of the hits.
We also decided that the program would analyze the proteins one at a time. The program asked us for the protein name, the first nucleotide and the scaffold, which we entered manually. After that, the shell printed the t-coffee result for us to analyze it. We found this was a useful way to verify if the obtention process and the alignment were satisfactory, and we decided if we had to modify any parameter of the program, repeat it or do it again using an exhaustive exonerate. Only proteins where the alignment was really bad (ex missing exons, huge gaps,...) were analyzed again using exhaustive exonerate. This time the alignment was satisfactory in almost all the cases.
You can see the automated program here.
Finally, we used the Selenoprotein prediction server seblastian.crg.es to predict the SECIS structures found in each protein. To search them, the file we used for each protein was the one obtained with fastasubseq. We discarded the SECIS that were on 5’ UTR extreme or located in the opposite strand of our protein. In the cases where we found more than one possible SECI we selected both predictions. In the cases where we did not found predicted proteins but there were SECIS predicted we used SECISearch3 to predict SECIS structures and information.
Both SECIS image and SECIS information were saved in different files named protein.png and protein.seblastian respectively.
We created a phylogenetic tree for each protein family with Phylogeny.fr in order to see if every predicted protein was close to its query. Phylogeny.fr is a free, simple to use web service dedicated to reconstructing and analysing phylogenetic relationships between molecular sequences. Phylogeny.fr runs and connects various bioinformatic programs to reconstruct a robust phylogenetic tree from a set of sequences.