Comprehensive comparative homeobox gene annotation in human and mouse

Database (Oxford). 2015 Sep 27:2015:bav091. doi: 10.1093/database/bav091. Print 2015.

Abstract

Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence.

Publication types

  • Comparative Study

MeSH terms

  • Animals
  • Databases, Nucleic Acid*
  • Genome, Human*
  • Helix-Turn-Helix Motifs
  • Homeodomain Proteins / genetics*
  • Humans
  • Mice
  • Molecular Sequence Annotation / methods*
  • Multigene Family*
  • Pseudogenes*
  • RNA, Long Noncoding / genetics

Substances

  • Homeodomain Proteins
  • RNA, Long Noncoding