Domino

Data collection

The genes used in this study corresponded to the autosomal genes that were present in the ExAC database, build r0.3,¹ and for which there was at least one rare synonymous and one rare non-synonymous variant. The inheritance mode for genes of the training set was inferred from various sources: hOMIM, a curated database for Mendelian diseases in humans;² RetNet (https://sph.uth.edu/retnet/), listing all genes involved in inherited retinal conditions; the Nosology of genetic skeletal diseases;³ and the complete index of new disease genes for Mendelian disorders published from 2009 to 2015 in the American Journal of Human Genetics. To ensure quality, we manually curated all the sources by discarding (i) all genes linked to both AD and AR inheritance, (ii) genes directly linked to cancer, (iii) disease genes that were not reported in the literature in more than one pedigree without additional very strong molecular evidence, and (iv) genes associated with non-clinical phenotypes.

Features

Properties of DNA variants

We extracted data relative to 47,800,457 variants from ExAC, build r0.3, filtered to be non-intronic and rare (MAF < 0.01). Then, they were classified as missense, synonymous, stop-gain, splice donor, splice acceptor or frameshift insertion/deletion. Missenses and synonymous variants were then annotated with PhyloP46way.placenta.⁴ All genes were annotated with the number of rare variants in ExAC, according to the type of changes. To take into account gene-specific mutational rates,⁵ the same features were divided by the number of synonymous variants, making in the end 11 features. We also extracted the number of heterozygotes and the number of homozygotes (for missenses, nonsenses, and synonymous), the SIFT⁶ and the PolyPhen⁷ scores (for missenses), as annotated in the ExAC dataset. The same was done for the PhyloP score. From these features, we computed the minimum, maximum, average and median scores for each gene, producing in the end 40 additional features. Furthermore, we annotated all genes with “functional gene constraints” from ExAC,⁸ namely synonymous, missense and loss-of-function (LoF) Z-scores and pLI (probability of being LoF intolerant), pRec (probablility to be intolerant of homozygous LoF variants) and pNull (probability of being tolerant to LoF variants). Since some of genes lack those values (mainly due to low coverage), we attributed to them Z-scores of 0 and probabilities of 0.5. At the end of this process, the total number of features related to DNA variants was 57.

Functional properties

We used BioMart9 from the Bioconductor package¹⁰ in the R programming language to extract the following functional features:
1. The number of transcripts per gene
2. The GC content per gene
3. The 50 most represented GO terms
4. The 50 most represented GOslim terms
5. All three GO domains per gene
6. The number of Interpro domains
7. The presence of at least one transmembrane domain for the encoded protein
8. The “Low complexity” tag
9. The presence of a signal peptide sequence
10. The presence of coiled coils
11. The length of 3’ and 5’ UTRs
12. The length of the coding sequence (CDS)
13. The number of paralogs (0, 1 or 2, for 2 or more paralogs)
14. The average, minimum and maximum of dN/dS with human paralogs
15. The average, minimum and maximum of percent identity with human paralogs
16. dN/dS between human and 67 different animals
The total number of functional features was 187.

Expression properties

We used the Gene Expression Database (GXD) from MGI¹¹ to assess whether a gene was expressed up to embryonic day 15 (E15) in the mouse embryo or whether it was not expressed after E15 (using immunohistochemistry, in situ reporters, RT-PCR and western blot experiments), adding 2 new features.

Evolutionary properties

We used PhyloP46way.vertebrate⁴ to obtain the conservation score for 5’UTR regions (-500 to +500bp from the TSS of each gene), which was identified as informative for haploinsufficient genes by Huang et al.¹² We then computed how many of these 1,000 values were lower than -0.5, 0 or 0.5, higher that 1, 1.3, 1.6 or 1.9 and also the mean and the median values, adding 9 features.

Interaction properties

We downloaded data from STRING v10¹³ to obtain interactions between genes and their relative confidence scores. We then computed, for each gene, the number of its interaction with AD genes of the training set, for the 8 different types of interaction present in the database, namely: neighborhood, gene fusion, co-occurence, co-expression, experiments, databases, textmining and combined score. For each type, we selected various confidence thresholds (150, 200, 250, 300, 350, 400, 450, 500 or 600) and various maximal numbers of interactions (3, 4, 5, 6, 7 or 8) to make 432 supplementary gene features.

Nonsense-mediated mRNA decay (NMD)

We downloaded the list of p-values of the meta-analysis performed by Colombo et al.¹⁴ to identify NMD targets in the human genome. This dataset contains 9 p-values per gene linked to differential mRNA expression for knock-down, rescue or meta experiments, making 9 new features for our model. For genes with missing values, we set all p-values at 1.

mRNA half-life

We used data obtained from the experiments by Sharova et al.¹⁵ reporting the mRNA half-life for 19,977 genes, expressed in mouse embryonic stem cells. mRNA half-life was categorized into 4 classes (>10h, >15h, <3h or <5h), making 4 more features for our model.

In total, we obtained 700 features per gene, in addition to their inheritance mode (for the complete list, please refer to the Supplementary table S2 of the paper).

Preselection of features

Features were preselected not to be constant in either AD or AR genes of the training set. Then a second filter was applied by taking only features which were significantly different between AD and AR genes of the training set (Wilcoxon signed-rank test, p<0.05, with Bonferroni correction). This allowed 308 features to enter in the LDA algorithm (Figure 1c of the paper).

LDA algorithm

The LDA model was composed of four steps:
1. For the first iteration: go to step 3. For the following iterations, test the individual removal of each feature of the previous model and remove it only if the AUC (Area Under the Curve) is decreasing by less than 0.0005. If so, go to step 4.
2. Test the individual replacement of each feature from the previous model with each feature not present in the previous model, from the pool of 308 pre-selected features. If the AUC is increasing, keep the change and go to step 4.
3. Add the non-selected feature that increases the most the AUC.
4. Compute the average AUC for training, testing and validation sets in 10x 10-fold cross-validation and retain the model. Go back to step 1.

Conversion of LDA score to probabilities

We approximated the distributions of scores for AD and AR genes with normal distributions by taking the average and the standard deviation of the scores (Figure 2a of the paper). More specifically, given a gene with score s, we computed its probability of being associated with dominant conditions (P(AD)) as the ratio between the probability density function (PDF) for AD genes divided by the sum of both AD and AR genes’ PDFs:

Weight of features in the LDA model

The weight for each feature was computed with the following formula (m=mean, c_i=LDA coefficient for the i^th feature of the pool of j selected features):

De novo variants analysis in intellectual disability and epilepsy

We downloaded denovo-db v1.5 (http://denovo-db.gs.washington.edu/denovo-db/Download.jsp) and extracted variants related to intellectual disability or epilepsy from exomes or genomes. We filtered them to have an impact on protein or expression (missense, nonsense, frameshift or splicing) and to be never seen in ExAC and ESP databases (see Web Resources) and selected genes with mutations present in at least two individuals with the same disease. We performed a hypergeometric test to assess if the ratio of genes with P(AD) ≥ 0.95 (h) over genes with P(AD) < 0.95 (l) was significantly higher in selected genes for intellectual disability or epilepsy compared to all autosomal human genes with an available P(AD) (H for ≥ 0.95 and L for < 0.95). We defined the enrichment factor as (h/l) / (H/L).

Software

RStudio (www.rstudio.com) version 0.99.893 was used for computational analysis and graphics with R version 3.3.2 (2016-10-31). The packages pROC¹⁶ and bioMart¹⁷ were also used.

References

1. Lek, M., Karczewski, K.J., Minikel, E.V., Samocha, K.E., Banks, E., Fennell, T., O'Donnell-Luria, A.H., Ware, J.S., Hill, A.J., Cummings, B.B., et al. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291.
2. Blekhman, R., Man, O., Herrmann, L., Boyko, A.R., Indap, A., Kosiol, C., Bustamante, C.D., Teshima, K.M., and Przeworski, M. (2008). Natural selection on genes that underlie human disease susceptibility. Curr. Biol. 18, 883-889.
3. Bonafe, L., Cormier-Daire, V., Hall, C., Lachman, R., Mortier, G., Mundlos, S., Nishimura, G., Sangiorgi, L., Savarirayan, R., Sillence, D., et al. (2015). Nosology and classification of genetic skeletal disorders: 2015 revision. Am. J. Med. Genet. A 167A, 2869-2892.
4. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R., and Siepel, A. (2010). Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110-121.
5. Hurst, L.D. (2002). The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 18, 486.
6. Ng, P.C., and Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812-3814.
7. Adzhubei, I., Jordan, D.M., and Sunyaev, S.R. (2013). Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet Chapter 7, Unit7 20.
8. Samocha, K.E., Robinson, E.B., Sanders, S.J., Stevens, C., Sabo, A., McGrath, L.M., Kosmicki, J.A., Rehnstrom, K., Mallick, S., Kirby, A., et al. (2014). A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944-950.
9. Smedley, D., Haider, S., Durinck, S., Pandini, L., Provero, P., Allen, J., Arnaiz, O., Awedh, M.H., Baldock, R., Barbiera, G., et al. (2015). The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589-598.
10. Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115-121.
11. Smith, C.M., Finger, J.H., Hayamizu, T.F., McCright, I.J., Xu, J., Berghout, J., Campbell, J., Corbani, L.E., Forthofer, K.L., Frost, P.J., et al. (2014). The mouse Gene Expression Database (GXD): 2014 update. Nucleic Acids Res. 42, D818-824.
12. Huang, N., Lee, I., Marcotte, E.M., and Hurles, M.E. (2010). Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154.
13. Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al. (2015). STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447-452.
14. Colombo, M., Karousis, E.D., Bourquin, J., Bruggmann, R., and Muhlemann, O. (2017). Transcriptome-wide identification of NMD-targeted human mRNAs reveals extensive redundancy between SMG6- and SMG7-mediated degradation pathways. RNA 23, 189-201.
15. Sharova, L.V., Sharov, A.A., Nedorezov, T., Piao, Y., Shaik, N., and Ko, M.S. (2009). Database for mRNA half-life of 19 977 genes obtained by DNA microarray analysis of pluripotent and differentiating mouse embryonic stem cells. DNA Res. 16, 45-58.
16. Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C., and Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77.
17. Durinck, S., Spellman, P.T., Birney, E., and Huber, W. (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184-1191.

Supplemental Methods