HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) contains transcription factor (TF) binding motifs represented as classic Position Weight Matrices (PWMs, also known as Position-Specific Scoring Matrices, PSSMs).
The PCM to PWM conversion scheme used in HOCOMOCO follows that of MACRO-APE, see the respective manual, pages 20–21. Uniform background frequencies were used for estimating the log-odds weightsand computing downloadable ‘PWM threshold-to-P-value’ tables.
HOCOMOCO motifs were constructed with ChIPMunk by systematic motif discovery from thousands of ChIP-Seq and HT-SELEX datasets. Please refer to the HOCOMOCO v12 paper for more details on the motif discovery procedure and to the Codebook MEX paper for details on data sources and motif discovery pipeline for v13-v14 update.
[Motif finding; Sequence scanning]
HOCOMOCO provides PWMs accompanied by precomputed score thresholds. The thresholds and P-value for HOCOMOCO v14 motifs were estimated against uniform background probabilities. To interactively visualize predicted TFBS in a small set of sequences we provide MoLoTool. For large-scale analysis, we suggest using command-line tools, such as our SPRY-SARUS or MEME's FIMO.
[Motif benchmarking; Performance metrics]
To assemble the motif collection of HOCOMOCO v14, we have used multiple benchmarking protocols evaluating the motif performance for TFBS recognition in genomic regions (in vivo data: ChIP-Seq), in artificial oligonucleotides (in vitro data: HT-SELEX, GHT-SELEX, SMiLE-Seq and PBM), and for predicting regulatory single-nucleotide variants and polymorphisms (rSNPs), please refer to the HOCOMOCO v12 paper. Newer motifs for poorly studied transcription factors were derived and benchmarked within the scope of the Codebook project, please refer to the Codebook MEX paper and Codebook MEX website for more details on benchmarking protocols and the performance metrics.
Each model in the collection has a quality rating from A to D where A represents motifs with the highest confidence. A quality motifs and subtypes were found in datasets obtained with at least two types of assays (ChIP-Seq, HT-SELEX, GHT-SELEX, SMiLE-Seq or PBM), B quality motifs were found in datasets coming from at least two different experiments of the same type, and C quality motifs passed expert curation but were coming only from a single experiment. In the core collection, D quality marks the least reliable motifs inherited from HOCOMOCO v11 and not benchmarked in later releases, there are only a few such cases in v13. In sub-collections, D quality denotes all motifs not tested in the respective benchmarks (i.e., no ChIP-Seq for v13-invivo, no HT-SELEX, GHT-SELEX, SMiLE-Seq, or PBM for v13-invitro, no rSNPs for v13-rsnp).
Since v11, alternative binding motifs of a particular TF are ranked from 0 (the primary model) to 1,2,.. (the alternative motifs). The motifs of rank 0 are the most 'general' variants with the best performance across available data in the benchmark (see the HOCOMOCO v12 paper for details).
HOCOMOCO v12 used two data types for motif discovery: ChIP-Seq and HT-SELEX. The latter came in two variants: traditional HT-SELEX and methyl-HT-SELEX with mCpGs. Since HOCOMOCO v13 three additional data types were used: GHT-SELEX (HT-SELEX with the input library of random genomic fragments), SMiLE-Seq, and PBM. Additionally, in benchmarking, we used information on differential transcription factor binding to single-nucleotide variants obtained in SNP-SELEX and identified from ChIP-Seq (the allele-specific binding, see ADASTRA). The motif ID encodes experiment types that yielded patterns assigned to the same subtype during expert curation. We use the following abbreviations of experiment types: P (ChIP-Seq), S (HT-SELEX), M (Methyl-HT-SELEX), G (Genomic HT-SELEX), I (SMiLE-Seq), and B (PBM). Motif IDs can include any combination of those six (PSMGIB) for motifs found in datasets, which came from several different types of assays.