The aim of this study was to annotate all the selenoproteins in the genome of Chlorocebus sabaeus using an in silico approach. This approach relies on homology with members of the known selenoprotein eukaryotic families. Besides, this analysis was complemented with a SECIS element prediction.
In order to find the homologous selenoproteins in C. Sabaeus we used the selenoproteins annotated both in Macaque mulatta and in Homo sapiens as queries. The reason why we used this two organisms instead of a single one is that while the macaque is the closest organism with annotated selenoproteins, the selenoproteins in human are more studied and better annotated.
The results obtained with macaque were notably similar to the results obtained with the human. Moreover, most of the selenoproteins predicted in the green monkey were also very similar to the ones annotated in primates, as expected given their phylogenetic proximity. In most cases, the results obtained with the “manual search” are consistent with the results obtained with Selenoprofiles, although minor differences have been reported in some predicted genes.

We have characterized a total of 25 selenoproteins, 11 Cys-homologs and 5 proteins involved in the synthesis of selenoproteins, as well as several pseudogenes and interrupted duplications. Here we summarize all the selenoproteins, Cys-homolgs and machinery proteins identified in Chlorocebus sabaeus:

  • Selenoproteins: GPx1, GPx2, GPx3, GPx4, GPx6, DI1, DI2, DI3, SPS2, Sel15, SelH, SelI, SelK, SelM, SelN, SelO, SelR1, SelS, SelT, SelV, TR1, TR2, TR3 and SelW1
  • Cys-homologues (or other homologues): GPx5, GPx7, GPx8, MsrA, SPS1, SelR2, SelR3, SelU1, SelU2, SelU3, SelW2.
  • Machinery: SecS, Pstk, Secp43, SBP2, eEFSec.

We have also analyzed for the existence of SECIS elements in the 3'UTR in all the predicted proteins. Theoretically, only those proteins which contain a selenocysteine residue should present a SECIS element. In all cases this premise has been confirmed.

Furthermore, we have identified several partial duplications and pseudogenes for the following proteins: SPS1, SPS2, SelI, SelK, SelR1, SelR2, SelR3, SelT, SelU1, SelU2, SelW1, SBP2, eEFSec.

Despite the fact that a primate pseudogene of GPx1 has been reported, we have not found it in Chlorocebus sabaeus' genome.
Interestingly, some of them have conserved the selenocysteine and exhibit a very high sequence similarity to the original gene, even in the SECIS element (SelK)

In addition, we should also bear in mind the limitations of our approach.
The main theoretical limitation of our study is that it entirely relies on homology. This means that it has allowed us to identify only those proteins that have already been annotated in other species. Thus, a possibility to take into consideration is that C. sabaeus might contain additional selenoproteins which have not been described yet. Further research should be done to look for the existence of unpredicted selenoproteins.

Furthermore, if a selenoprotein has diverged more than a certain threshold, it might not be identified by homology and thus it will require a de novo prediction approach.
Moreover, another limitation of the homology approach is that a priori information about the selenoprotein sequence is required. Although macaque and human are also primates and their genomes are relatively well annotated, they are not optimal to perform an homology search. The ideal organism should have been phylogenetically closer to the green monkey, such as another specie from the Chlorocebus genus.

We also have to take into consideration that we should have performed a robust statistical analysis in order to further characterize real SECIS and protein coding genes. For instance, normalization to neutral regions in the genome might have served to detect the presence or absence of selection in our predicted genes, pseudogenes or SECIS.

Another important limitation in our study is the poor annotation of selenoprotein isoforms both in SelenoDB and in NCBI. In most cases there is no consensus regarding the biologically relevant isoform and consequently multiple proteins are associated to a single selenoprotein. This problem has led us to multiple sequence alignments with several gaps, particularly in the end or in the beginning of the protein.
This observation epitomizes the difficulty of in silico selenoprotein annotation and reveals the importance of experimental data for selenoprotein annotation.

Finally, we would like to highlight that we have not noticed any problems regarding the sequencing depth and the assembly of the genome.

In conclusion, our study provides an insight into the characterization of the primate selenoproteome and it contributes to the current knowledge in selenoproteins.