cMonkey: systems biology (bi)clustering using multiple relevant data-types.

Grouping genes into functionally related and putatively co-regulated clusters is an essential first step for the inference of regulatory networks (one could think of many reasons for doing this but network inference is a good problem to start with). It is widely known that regulatory relationships among genes can vary under diverse environmental settings, and that co-expressed genes are trivially under control of the same regulator(s). This leads to patterns of co-expression that are valid under some, not all, observed conditions.

With such considerations in mind, we have developed cMonkey, an unsupervised learning procedure for detecting putatively co-regulated gene clusters by integrating diverse systems biology data including: (1) mRNA and/or protein expression levels, (2) cis-regulatory sequences, and (3) functional association and physical interaction networks. The method determines, for each cluster detected, the significant (a) subset of experimental conditions under which its genes are co-expressed, (b) cis-regulatory sequence motifs that putatively mediate their regulation, and (c) highly-connected subnetworks in association networks that provide supporting evidence that the genes are functionally related.

We have performed this analysis on publicly-available data sets of widely varying size and quality, for four distantly related organisms covering all three domains of life: Helicobacter pylori, Escherichia coli, Saccharomyces cerevisiae, and Halobacterium NRC-1. In each of these organisms, cMonkey has enabled the discovery of both known regulons and novel putative regulons. When regulons correspond to previously characterized TF binding sites we see good agreement with the motifs detected by cMonky. Gene clusters detected using cMonkey provide a firm foundation for inferring genetic regulatory networks, for assigning putative functions for genes of unknown function.

This work is being carried out in collaboration with Nitin Baliga and David Reiss at the ISB, Seattle. We have performed this analysis on several organisms, but a clear challenge and the heart of future development lies in applying this method and adapting it to larger organisms such as some of the multicellular systems being studied at NYU or human. The bottom line is that we have not met a data-type we yet that we can't integrate into the cMonkey score and thus the MCMC procedure, in practice this means we've got a lot of work cut out for the near future.

NYC
NYU:
Biology
CompSci