Filtering: Some genes are not expressed in any samples. Others are expressed at extremely low levels. We need to remove these genes from further analysis. By default, a gene has to have more than 0.5 counts per million (CPM) in at least one sample. Otherwise, the gene is removed. Users can specify that genes have to be above 0.5 CPM in two samples by changing the “n libraries” option to 2.
CPMs are calculated by normalizing the read counts by the total counts per sample. For example, if sample 1 has 20 million counts in total, then the read counts are divided by 20 to get the CPM.
The R command used for filtering is this.
x <- x[ which( apply( cpm(DGEList(counts = x)), 1, function(y) sum(y>=minCPM)) >= nLibraries ) , ]
The data is normalized by cpm function in edgeR. The number of samples above a minCPM is counted. Only genes with levels above minCPM in at least nLibraries are retained.
It is very common to filter out 30% or even 50% of the genes. Keeping lowly expressed genes just adds noise to downstream analyses. But aggressive filtering may lead missing some genes that are expressed low but are significantly different. minCPM can be reduced if your library size is big.
Transformation: There are 3 options for transformation of counts data for clustering analysis and principal component analysis:
- VST: variance stabilizing transform
- rlog: regularized log (only for N<10)
- Started log: log2(x+c)
VST is performed according to (Anders and Huber 2010) and rlog is based on (Love, Huber et al. 2014). When there are more than 10 samples, rlog becomes slow. The default is started log, where a pseudo count c is added to all counts before log transformation. The constant c can range from 1 to 10. The bigger this number is, the less sensitive it is to noise from low counts. We have found c=4 offer balanced transformation based on several datasets we tried. The started log transformation is equivalent to the logCPM offered in edgeR. The only difference is that the libraries are not all scaled to 1 million per sample. The libraries are scaled first using DESeq2’s normalized = TRUE option:
x <- log2( counts(dds, normalized=TRUE) + c )
The effect of these transformations can be visualized below between technical replicates. You can see that VST is very aggressive in transforming the data. The rlog option is slower, especially when there is many samples, but it makes the distribution across samples closer than a simple log.
For counts data, transformed data is used for exploratory data analysis (EDA, such as clustering analysis and PCA, MDS). If you choose limma-trend to identify differentially expressed genes (DEGs), then transformed data is also used. Both DESeq2 and limma-voom use original read counts data, not the transformed data.
iDEP produces a barplot representing total read counts per library. When the library sizes are more than 3 times different, limma-trend method is not recommended for identifying DEGs. See the manual for limma.
Bias in sequencing depth can exist in samples. The above is an example, where fewer reads in the p53 wild-type samples after radiation treatment. An ANOVA is conducted by iDEP and if P<0.01, a warning is produced. This presents a confounding factor. iDEP also calculates the ratio of the max/min total counts among groups. It is believed that as long as the ratio is within 1.5, it should be acceptable. See this discussion on Twitter.
FPKM, microarray or other normalized expression data
For normalized expression data, a filter is also applied to remove genes expressed at low levels across all samples. For example, users can choose to include only genes expressed at the level of 1 or higher in at least one sample. This number works for FPKM or RPKM data, but it should be changed according to the data format. For cDNA microarrays, where the expression levels are log ratios, we need to set this to a negative number such as -1e20 to disable this filter.
This filter is very important as it affects downstream analyses. For FPKM, RPKM data, many people choose 1 as a cutoff. For genome-wide expression analysis using DNA microarray, it is safe to choose a cutoff that removes the bottom 20-40% of genes when ranked by maximum expression level across samples in descending order.
Users can choose to perform log transformation. iDEP calculates kurtosis for each of the data columns, and if the mean kurtosis is bigger than 50, a log2 transformation is enforced. Large kurtosis usually indicates the presence of extremely large numbers in the data set that warrants log-transformation.
Users can double check the effects of data transformation by examining the box plot and density plot on this page.
Anders, S. and W. Huber (2010). “Differential expression analysis for sequence count data.” Genome Biol 11(10): R106.
Love, M. I., W. Huber and S. Anders (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol 15(12): 550.