k-Means clustering is an unsupervised method for clustering genes into groups based on their expression pattern across all samples. First genes are ranked by standard deviation and only to the top 2000 are used for clustering. Users can increase the number of genes up to 6000. The default number of clusters is 4, which can be adjusted to between 2 and 20. The data is normalized so that the rows (genes) have the same sum (L1 norm):
x = 100* x[1:n,] / apply(x[1:n,],1,sum)
cl = kmeans(x,k,iter.max = 50)
Enrichment analysis is conducted for each cluster using gene sets selected from the left bar. Also, for each cluster enriched transcription factor (TF) binding motif is identified. Transcript annotation and promoter sequences are retrieved from Ensembl. For genes with multiple transcripts, the transcription start site (TSS) with multiple transcripts is used. If multiple TSS locations have the same number of transcripts, then the most upstream TSS is used. Promoters are pre-scanned using TF binding motifs in CIS-BP (Weirauch et al., 2014). Instead of defining a binary outcome of binding or not binding, which depends on arbitrary cutoffs, we recorded the best score for each of the TFs in every promoter sequences. Then student’s t-test is used to compare the scores observed in a group of genes against the rest genes. The P-values are corrected for multiple testing using false discovery rate (FDR).
Weirauch, M.T., Yang, A., Albu, M., Cote, A.G., Montenegro-Montero, A., Drewe, P., Najafabadi, H.S., Lambert, S.A., Mann, I., Cook, K., et al. (2014). Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431-1443.