Sampling with Expectation maximization for Motif Elicitation

Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm

Although de novo motifs can be discovered through mining over-represented patterns, this approach misses some real motifs and generates many false positives. On the other hand, additional information for the input sequences are found to be helpful to improve motif finding. For promoter sequences extracted from a set of co-expressed genes, some transcription factor (TF) binding motifs (e.g. TATA-box) are localized to certain intervals with respect to the transcription start site (TSS) of the gene. In this case, the position information can help to filter spurious sites. In protein binding microarray (PBM) data, the De Bruijn sequences are ranked by the binding affinity and we expect the correct motif occurs in the high ranking sequences; such data has a rank preference. In the ChIP-seq data, the ChIPed TF's motif (ChIPed TF is the TF pulled down in the ChIP experiment) prefers to occur in sequences with high ChIP intensity and occur near the ChIP peak summits (have both position and rank preference). Hence, if we know the position preference and the sequence rank preference of the TF motifs in the input sequences, we can improve motif finding. In fact, many existing motif finders already utilize such additional information. Some only considers high ranking sequences to generate its initial candidate motifs. Some allow users to specify the prior distribution of position preference or sequence rank preference or others allow users to add such preferences as a prior knowledge component in their scoring functions. However, the users may not know the correct prior(s) to begin with. Even worse, different motifs may have different preferences. For example, in ChIP-seq experiments, some motifs prefer to occur in high ranking sequences and at the center of the ChIP peak summit while others (co-regulate TF motifs) do not.

To resolve such problem, we propose a novel motif finding algorithm called SEME ( Sampling with Expectation maximization for Motif Elicitation). SEME assumes the set of input sequences is a mixture of two models: a motif model and a background model. It uses unsupervised mixture model learning to learn the motif pattern (PWM), position preference and sequence rank preference at the same time; instead of asking users to provide them as inputs. SEME does not assume the presence of both preferences but automatically detect them during the motif refinement process by statistical significance testing. We also observe that EM algorithms are generally slow in analyzing large scale high throughput data. To improve the efficiency, SEME developed two EM procedures. The two EM procedures are based on the observations that the correct motifs usually have a short conserved pattern in it and majority of the sites in the input sequences are non-motif sites. For the first EM procedure, called extending EM (EEM), starts by finding all over-represented short kmers and then attempts to include and refine the flanking positions around the kmers within the EM iterations. This way, SEME recovers the proper motif length within a single run thus saving a substantial amount of time by avoiding multiple runs with different motif length (as done in many existing motif finders). The second EM procedure, called the re-sampling EM (REM), tries to further refine the motif produced by EEM. It is based on a theorem similar to importance sampling, which stated that the motif parameters can be learned unbiasedly using a biased subsampling. By this principle, we can sample more sites which are similar to the EEM's motif and less sites from the background. This way, REM is able to learn the correct motifs using significantly less background sites. In our implementation, REM is capable to produce the correct TF motifs using approximately 1% of the sites normally considered in a normal EM procedure.

SEME Pipeline

SEME is described in our paper:
"Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm", Z.Zhang, C.Cheng, W.Hugo, E.Cheung and W.Sung, RECOMB 2012, accepted.

Supplementary Document

Supplementary Data

Try SEME Online

Obtaining a local copy of SEME

Download ChIP-seq peak data (our lab)

Download synthetic benchmark datasets

Result for 164 large scale chipseq datasets
Result for CoTF motif finding
Result for the JASPAR synthetic datasets
Result for the metazoan compendium