iDEP accepts 4 types of files, namely:
- Gene expression matrix (example1, example2),
- fold-change and P values (example),
- experimental design (example), and
- gene sets (example).
These example files can also be accessed here.
To help users prepare these files we also developed a tool for extracting a column from multiple files. A tool for downloading processed public RNA-seq data. And another tool for converting Gene Ontology files into the desired GMT format. ShinyGO is another related tool that conducts enrichment analysis on a set of gene IDs.
1. Gene expression matrix
Upload a CSV (comma-separated value) or tab-delimited text files containing gene expression data. For RNA-seq data, gene-level read count data is recommended, as differentially expressed genes (DEGs) can be identified based on statistical modeling of counts using DESeq2. FPKM, RPKM, TPM or other types of normalized expression data (DNA microarray, proteomics, etc) are also accepted. File size limit is 100 Mb.
Each row represents expression level of a gene. The first row is treated as sample names. Each column contains data on a sample with the first column as gene IDs. You do not need to log-transform your data before uploading to iDEP. The first few lines of a typical read-count file look like this:
WT_Rep1 | WT_Rep2 | hoxa1_Rep1 | hoxa1_Rep2 | |
ENSG00000198888 | 17528 | 23007 | 24418 | 29152 |
ENSG00000198763 | 21264 | 26720 | 28878 | 32416 |
ENSG00000198804 | 130975 | 151207 | 178130 | 196727 |
ENSG00000198712 | 49769 | 61906 | 66478 | 69758 |
ENSG00000228253 | 9304 | 11160 | 12608 | 13041 |
Gene IDs
iDEP can convert most types of common gene IDs to Ensembl gene IDs, which is used internally for enrichment analyses. It also uses gene IDs to guess what organism your data is derived from. See a list of plant and animal species covered by the database here. If your gene IDs does not match anything in the database, you can still use the website to do clustering analysis and identify DEGs.
If multiple IDs are mapped to the same ENSEMBL gene, only the one with largest standard deviation is kept. Gene IDs not recognized by iDEP will be kept in the data using original gene IDs.
Users can also avoid gene ID conversion by checking the “Do not convert gene IDs to Ensembl” checkbox in the “Pre-Process” page. This is useful when the user’s data is already Ensembl gene IDs, or the user just wants to conduct exploratory data analysis (EDA) and identify DEGs.
If you are working on a species covered by iDEP, but your ID is not recognized, you can send us a file that maps your IDs to Ensembl gene IDs. We will incorporate it into the system, and you should be able to use the website after a few days.
Gene IDs should not contain quotation marks like ‘ or “. This causes problems in the reading of data file and also the mapping of gene IDs.
Sample names
Name columns carefully as iDEP parses column names to define sample groups. Replicates should be denoted by “_Rep1”, “_Rep2”, “_Rep3” at the end. Or it can be simply “_1”, “_2”, “_3”. For example, Control_1, Control_2, TreatmentA_1, TreatmentA_2, TreatmentB_1, TreatmentB_2. The first part defines 3 sample groups that form the basis for differential expression analysis. All pair-wise comparisons are listed and analyzed. Also, avoid using a hyphen “-” or a dot “.” in sample names. It affects the parsing of sample names. But underscore “_” is allowed.
More complex study designs can be represented by uploading a study design file. See below.
2. Fold-changes and P values from other programs
Users can also upload a file containing log fold-changes (LFCs) and P values that were derived from Cuffdiff or other programs. It does not even have to be RNA-seq data. iDEP will use these numbers directly for DEGs and downstream analysis such as clustering, Venn diagrams, enrichment analysis, and pathway analysis. Below is an example of two comparisons (trtA, trtB) where LFC and FDR derived from Cuffdiff.
trtA_LFC | FDR | trtB_LFC | FDR | |
Atp13a1 | -1.6474 | 0.001 | 1.2 | 0.3 |
Pten | -2.0093 | 0.8 | 0 | 0.12 |
Blnk | 1.4008 | 0.02 | -1.5 | 0.001 |
Btk | -1.9195 | 0.34 | 2.3 | 0.01 |
Zmynd8 | 2.2636 | 0.12 | -0.5 | 0.8 |
Pgd | -2.2744 | 0.001 | 2.7 | 0.001 |
Users can also just upload log fold-changes. With such data, we can run EDA and pathway analysis.
LFC1 | LFC2 | |
Atp13a1 | -1.6474 | 1.2 |
Pten | -2.0093 | 0 |
Blnk | 1.4008 | -1.5 |
Btk | -1.9195 | 2.3 |
Zmynd8 | 2.2636 | -0.5 |
Pgd | -2.2744 | 2.7 |
3. Experiment design file
You can upload a sample information file (CSV or tab-delimited text file) representing experimental design. This is especially useful when the experiment is complex or you want to control for batch effect or paired samples. The column names in design file must match the those in the expression matrix file.
A. Paired samples (ignored by many)
Paired samples are quite common, but are often ignored in analysis. For example, researchers want to know if eating ice cream will change gene expression in the mouse blood cells. On Monday they draw blood on the first mouse before and after feeding ice cream. To get more robust results, they repeated the experiment on Tuesday and Wednesday using genetically identical mice. We have three replicates. But clearly the first control and first treatment samples are derived from the same mice and the experiment is done on Monday. There are special one-to-one relationship among control and treatment samples. To account for the systematic biases, we need to control for this paired effect. This is similar to the many situations of paired t-tests. To achieve this we need a design file like this:
Ctrl_1 | Ctrl_2 | Ctrl_3 | Treat_1 | Treat_2 | Treat_3 | |
Treatment | Ctrl | Ctrl | Ctrl | Treat | Treat | Treat |
pairs | 1 | 2 | 3 | 1 | 2 | 3 |
The sample names in the first row must match the sample names in the expression matrix file. Here we have two factors, “Treatment” and “pairs”. In the DEG1 tab, we can choose pairs as a block factor. This will enable DESeq2 or limma to run paired tests by using a statistical model: ~ group + pairs. More generally, block factors can be used to control many other confounding factors.
The levels of different factors must be distinct. Otherwise, iDEP will add factor names to these levels to avoid confusion.
Information in the sample information file can be used to label samples in visualizations. We can also define desired models, including interaction terms, using DESeq2 or limma.
B. 2×2 factorial design
Many types of experiments can be represented as factorial designs. This is a classic 2×2 factor example, where two genotypes are studied before and after treatment. For example, in one of our demo data, the mutant has p53 knocked out and the treatment is ionizing radiation (IR). The researchers wanted to study p53-dependent and independent pathways in IR response.
WT_Ctrl_1 | WT_Ctrl_2 | WT_Trt_1 | WT_Trt_2 | Mu_Ctrl_1 | Mu_Ctrl_2 | Mu_Trt_1 | Mu_Trt_2 | |
Genotype | WT | WT | WT | WT | Mu | Mu | Mu | Mu |
Treatment | Ctrl | Ctrl | Trt | Trt | Ctrl | Ctrl | Trt | Trt |
If we do not suspect any difference in IR response(two factors are independent), we could use a model: ~ Genotype + Treatment. The results include two comparisons:
- What is the difference between genotypes (WT vs Mu)?
- What is the response to Treatment (Trt vs Ctrl)?
C. 2×2 factorial design with interaction
If we suspect that the mutant will response differently, then we should use the interaction term using this model: ~ Genotype + Treatment + Genotype*Treatment
The interaction term will tell use the difference in response. DESeq2 or limma will give us results to answer these questions:
- What is the response in WT,
- What is the response in Mu,
- The difference in response (interaction term),
- The difference between WT and Mu, without treatment,
- The difference between WT and Mu, with treatment.
Sometimes factorial design is used but not obvious. For example, a researcher knocked out two genes, ESR1 and ESR2, first separately and then both. Here we have four biological samples, wild-type, ESR1 knockout, ESR2 knockout, and double knock out (DKO). The researcher is very interested in the effect of double knockout. This can be represented by this design matrix:
WT_1 | WT_2 | ESR1_1 | ESR1_2 | ESR2_1 | ESR2_2 | DKO_1 | DKO_2 | |
ESR1 | 1WT | 1WT | 1KO | 1KO | 1WT | 1WT | 1KO | 1KO |
ESR2 | 2WT | 2WT | 2WT | 2WT | 2KO | 2KO | 2KO | 2KO |
You can set up a model in iDEP like : ~ ESR1 + ESR2 + ESR1:ESR2, where the last term represents interaction term, representing the special effects of double knockout.
D. 2×3, 3×3 designs
If two genotypes are treated with two drugs, then it is a 2×3 design. This is because the treatment factor now has 3 levels, i.e. control, treatment1, treatment2.
If three cell lines are treated with 2 drugs, then it is 3×3 design.
But we still have two factors. Just these factors are having more than 2 levels.
E. Three or more factors
Some researcher tried to get a lot of information into the design matrix. On human cancer samples, for example, they put in sex, disease type, etc. One common error is that the model is so complex that it requires a lot samples for the model to be fitted. This is specially true when you consider interaction terms with factors. Sometimes, users created a models with dozens of samples, but they only provided 10 samples.
iDEP is designed to handle simple experiment designs. It is limited in dealing with complex models. Currently interactions terms using the limma package is only available for two factor design.
F. Time course data
Time points can be represented as factors too. Here researchers place plants with cold temperature and wanted to see gene expression change in both root and leaves. Note that the untreated samples were labeled as 0 hour.
leaf_0h_1 | leaf_0h_2 | leaf_12h_1 | leaf_12h_2 | leaf_48h_1 | leaf_48h_2 | root_0h_1 | root_0h_2 | root_12h_1 | root_12h_2 | root_48h_1 | root_48h_2 | |
time | 0h | 0h | 12h | 12h | 48h | 48h | 0h | 0h | 12h | 12h | 48h | 48h |
tissue | leaf | leaf | leaf | leaf | leaf | leaf | root | root | root | root | root | root |
4. Upload pathway data for new species
Users can also upload their own gene set file for enrichment analysis and pathway analysis. This is possible by choosing “**NEW SPECIES**” in step 3 of the data upload page.
Gene set files must be in GMT format, which is explained in details here. An example is shown below. The first column is pathway ID, the 2nd column is some memo text that is ignored, the rest are gene IDs. The gene IDs do not have to be Ensembl IDs, but must match the gene IDs in uploaded expression data. This is because no gene ID conversion will be done. Pathway data used by iDEP is available here.
GOBP_hsa_ens_GO:0000966_RNA_5-end_processing | AnyText | ENSG00000146109 | ENSG00000197181 | ENSG00000138035 | ENSG00000174173 | ENSG00000087269 | ENSG00000120800 | |||||||||
GOBP_hsa_ens_GO:0001503_ossification | AnyText | ENSG00000020633 | ENSG00000087494 | ENSG00000118785 | ENSG00000136929 | ENSG00000138756 | ENSG00000159216 | ENSG00000162337 | ENSG00000168646 | ENSG00000173868 | ENSG00000197594 | ENSG00000011028 | ENSG00000012223 | ENSG00000014123 | ENSG00000017427 | ENSG00000029559 |
GOBP_hsa_ens_GO:0001510_RNA_methylation | AnyText | ENSG00000059588 | ENSG00000068438 | ENSG00000104907 | ENSG00000105202 | ENSG00000108592 | ENSG00000111641 | ENSG00000122435 | ENSG00000122687 | ENSG00000126749 | ENSG00000127804 | ENSG00000130299 | ENSG00000130305 | ENSG00000134077 | ENSG00000135297 | ENSG00000138050 |
Sample size
To use limma or DESeq2 to identify DEGs, you should have at least two groups (biological samples), each with two or more replicates. For running exploratory data analysis such as PCA, hierarchical clustering and k-Means clustering, you can have just two data columns in your file.
R code for reading data:
x <- read.csv(inFile) # try CSV if(dim(x)[2] <= 2 ) # ifless than 3 columns, try tab-deliminated x <-read.table(inFile, sep="\t",header=TRUE) x[,1] <- toupper(x[,1]) x = x[order(-apply(x[,2:dim(x)[2]],1,sd) ),] x <- x[!duplicated(x[,1]) ,] rownames(x) <- x[,1] x <- as.matrix(x[,-1])
R code for parsing sample names:
detectGroups <- function(x){ # x are col names tem <-gsub("[0-9]*$","",x) # Remove all numbers from end tem <-gsub("_Rep$","",tem); # remove "_Rep" from end tem <-gsub("_rep$","",tem); # remove "_rep" from end tem <-gsub("_REP$","",tem) # remove "_REP" from end return( tem ) } groups = detectGroups( colnames(data) )
As you could see from here, all numbers at the end of sample names are disregarded. Thus you can name your samples like WT1, WT2, Treatment1, Treatment2, etc.
Updated 8/6/2022