iDEP accepts 4 types of files, namely:
- Gene expression matrix (example1, example2),
- fold-change and P values (example),
- experimental design (example), and
- gene sets (example).
These example files can also be accessed here.
To help users prepare these files we also developed a tool for extracting a column from multiple files. A tool for downloading processed public RNA-seq data. And another tool for converting Gene Ontology files into the desired GMT format. ShinyGO is another related tool that conducts enrichment analysis on a set of gene IDs.
1. Gene expression matrix
Upload a CSV (comma-separated value) or tab-delimited text files containing gene expression data. For RNA-seq data, gene-level read count data is recommended, as differentially expressed genes (DEGs) can be identified based on statistical modeling of counts using DESeq2. FPKM, RPKM, TPM or other types of normalized expression data (DNA microarray, proteomics, etc) are also accepted. File size limit is 100 Mb.
Each row represents expression level of a gene. The first row is treated as sample names. Each column contains data on a sample with the first column as gene IDs. You do not need to log-transform your data before uploading to iDEP. The first few lines of a typical read-count file look like this:
WT_Rep1 | WT_Rep2 | hoxa1_Rep1 | hoxa1_Rep2 | |
ENSG00000198888 | 17528 | 23007 | 24418 | 29152 |
ENSG00000198763 | 21264 | 26720 | 28878 | 32416 |
ENSG00000198804 | 130975 | 151207 | 178130 | 196727 |
ENSG00000198712 | 49769 | 61906 | 66478 | 69758 |
ENSG00000228253 | 9304 | 11160 | 12608 | 13041 |
Gene IDs
iDEP can convert most types of common gene IDs to Ensembl gene IDs, which is used internally for enrichment analyses. It also uses gene IDs to guess what organism your data is derived from. See a list of plant and animal species covered by the database here. If your gene IDs does not match anything in the database, you can still use the website to do clustering analysis and identify DEGs.
If multiple IDs are mapped to the same ENSEMBL gene, only the one with largest standard deviation is kept. Gene IDs not recognized by iDEP will be kept in the data using original gene IDs.
Users can also avoid gene ID conversion by checking the “Do not convert gene IDs to Ensembl” checkbox in the “Pre-Process” page. This is useful when the user’s data is already Ensembl gene IDs, or the user just wants to conduct exploratory data analysis (EDA) and identify DEGs.
If you are working on a species covered by iDEP, but your ID is not recognized, you can send us a file that maps your IDs to Ensembl gene IDs. We will incorporate it into the system, and you should be able to use the website after a few days.
Gene IDs should not contain quotation marks like ‘ or “. This causes problems in the reading of data file and also the mapping of gene IDs.
Sample names
Name columns carefully as iDEP parses column names to define sample groups. Replicates should be denoted by “_Rep1”, “_Rep2”, “_Rep3” at the end. Or it can be simply “_1”, “_2”, “_3”. For example, Control_1, Control_2, TreatmentA_1, TreatmentA_2, TreatmentB_1, TreatmentB_2. The first part defines 3 sample groups that form the basis for differential expression analysis. All pair-wise comparisons are listed and analyzed. Also, avoid using a hyphen “-” or a dot “.” in sample names. It affects the parsing of sample names. But underscore “_” is allowed.
More complex study designs can be represented by uploading a study design file. See below.
2. Fold-changes and P values from other programs
Users can also upload a file containing log fold-changes (LFCs) and P values that were derived from Cuffdiff or other programs. It does not even have to be RNA-seq data. iDEP will use these numbers directly for DEGs and downstream analysis such as clustering, Venn diagrams, enrichment analysis, and pathway analysis. Below is an example of two comparisons (trtA, trtB) where LFC and FDR derived from Cuffdiff.
trtA_LFC | FDR | trtB_LFC | FDR | |
Atp13a1 | -1.6474 | 0.001 | 1.2 | 0.3 |
Pten | -2.0093 | 0.8 | 0 | 0.12 |
Blnk | 1.4008 | 0.02 | -1.5 | 0.001 |
Btk | -1.9195 | 0.34 | 2.3 | 0.01 |
Zmynd8 | 2.2636 | 0.12 | -0.5 | 0.8 |
Pgd | -2.2744 | 0.001 | 2.7 | 0.001 |
Users can also just upload log fold-changes. With such data, we can run EDA and pathway analysis.
LFC1 | LFC2 | |
Atp13a1 | -1.6474 | 1.2 |
Pten | -2.0093 | 0 |
Blnk | 1.4008 | -1.5 |
Btk | -1.9195 | 2.3 |
Zmynd8 | 2.2636 | -0.5 |
Pgd | -2.2744 | 2.7 |
3. Experiment design file
Optionally, you can upload a sample information file (CSV or tab-delimited text file) representing experimental design. This is especially useful when the experiment is complex or you want to control for batch effect or paired samples.
WT_1 | WT_2 | WT_3 | Hoxa1_1 | Hoxa1_2 | Hoxa1_3 | |
genotype | WT | WT | WT | Mutant | Mutant | Mutant |
batch | A | B | C | A | B | C |
The sample names in the first row must match the sample names in the expression matrix file. Here we have two factors, “genotype” and “batch”. “genotype” has two levels (WT or Mutant), while “batch” has three levels (A, B, C) corresponding to sample pairs. This will enable DESeq2 or limma to run paired tests by using a statistical model: ~ group + batch.
The levels of different factors must be distinct. Otherwise, iDEP will add factor names to these levels to avoid confusion.
Information in the sample information file can be used to label samples in visualizations. We can also define desired models, including interaction terms, using DESeq2 or limma.
Many types of experiments can be represented as factorial designs. This is a classic 2×2 factor example, where two genotypes are studied before and after treatment.
WT_Ctrl_1 | WT_Ctrl_2 | WT_Trt_1 | WT_Trt_2 | Mu_Ctrl_1 | Mu_Ctrl_2 | Mu_Trt_1 | Mu_Trt_2 | |
Genotype | WT | WT | WT | WT | Mu | Mu | Mu | Mu |
Treatment | Ctrl | Ctrl | Trt | Trt | Ctrl | Ctrl | Trt | Trt |
Sometimes factorial design is used but not obvious. For example, a researcher knocked out two genes, ESR1 and ESR2, first separately and then both. Here we have four biological samples, wild-type, ESR1 knockout, ESR2 knockout, and double knock out (DKO). The researcher is very interested in the effect of double knockout. This can be represented by this design matrix:
WT_1 | WT_2 | ESR1_1 | ESR1_2 | ESR2_1 | ESR2_2 | DKO_1 | DKO_2 | |
ESR1 | 1WT | 1WT | 1KO | 1KO | 1WT | 1WT | 1KO | 1KO |
ESR2 | 2WT | 2WT | 2WT | 2WT | 2KO | 2KO | 2KO | 2KO |
You can set up a model in iDEP like : ~ ESR1 + ESR2 + ESR1:ESR2, where the last term represents interaction term, representing the special effects of double knockout.
Time points can be represented as factors too. Here researchers place plants with cold temperature and wanted to see gene expression change in both root and leaves. Note that the untreated samples were labeled as 0 hour.
leaf_0h_1 | leaf_0h_2 | leaf_12h_1 | leaf_12h_2 | leaf_48h_1 | leaf_48h_2 | root_0h_1 | root_0h_2 | root_12h_1 | root_12h_2 | root_48h_1 | root_48h_2 | |
time | 0h | 0h | 12h | 12h | 48h | 48h | 0h | 0h | 12h | 12h | 48h | 48h |
tissue | leaf | leaf | leaf | leaf | leaf | leaf | root | root | root | root | root | root |
4. Upload pathway data for new species
Users can also upload their own gene set file for enrichment analysis and pathway analysis. This is possible by choosing “**NEW SPECIES**” in step 3 of the data upload page.
Gene set files must be in GMT format, which is explained in details here. An example is shown below. The first column is pathway ID, the 2nd column is some memo text that is ignored, the rest are gene IDs. The gene IDs do not have to be Ensembl IDs, but must match the gene IDs in uploaded expression data. This is because no gene ID conversion will be done. Pathway data used by iDEP is available here.
GOBP_hsa_ens_GO:0000966_RNA_5-end_processing | AnyText | ENSG00000146109 | ENSG00000197181 | ENSG00000138035 | ENSG00000174173 | ENSG00000087269 | ENSG00000120800 | |||||||||
GOBP_hsa_ens_GO:0001503_ossification | AnyText | ENSG00000020633 | ENSG00000087494 | ENSG00000118785 | ENSG00000136929 | ENSG00000138756 | ENSG00000159216 | ENSG00000162337 | ENSG00000168646 | ENSG00000173868 | ENSG00000197594 | ENSG00000011028 | ENSG00000012223 | ENSG00000014123 | ENSG00000017427 | ENSG00000029559 |
GOBP_hsa_ens_GO:0001510_RNA_methylation | AnyText | ENSG00000059588 | ENSG00000068438 | ENSG00000104907 | ENSG00000105202 | ENSG00000108592 | ENSG00000111641 | ENSG00000122435 | ENSG00000122687 | ENSG00000126749 | ENSG00000127804 | ENSG00000130299 | ENSG00000130305 | ENSG00000134077 | ENSG00000135297 | ENSG00000138050 |
Sample size
To use limma or DESeq2 to identify DEGs, you should have at least two groups (biological samples), each with two or more replicates. For running exploratory data analysis such as PCA, hierarchical clustering and k-Means clustering, you can have just two data columns in your file.
R code for reading data:
x <- read.csv(inFile) # try CSV if(dim(x)[2] <= 2 ) # ifless than 3 columns, try tab-deliminated x <-read.table(inFile, sep="\t",header=TRUE) x[,1] <- toupper(x[,1]) x = x[order(-apply(x[,2:dim(x)[2]],1,sd) ),] x <- x[!duplicated(x[,1]) ,] rownames(x) <- x[,1] x <- as.matrix(x[,-1])
R code for parsing sample names:
detectGroups <- function(x){ # x are col names tem <-gsub("[0-9]*$","",x) # Remove all numbers from end tem <-gsub("_Rep$","",tem); # remove "_Rep" from end tem <-gsub("_rep$","",tem); # remove "_rep" from end tem <-gsub("_REP$","",tem) # remove "_REP" from end return( tem ) } groups = detectGroups( colnames(data) )
As you could see from here, all numbers at the end of sample names are disregarded. Thus you can name your samples like WT1, WT2, Treatment1, Treatment2, etc.