Data format

iDEP accepts 4 types of files, namely:

  1. Gene expression matrix (example1, example2),
  2. fold-change and P values (example),
  3. experimental design (example), and
  4. gene sets (example).

These example files can also be accessed here.

To help users prepare these files we also developed a tool for extracting a column from multiple files. A tool for downloading processed public RNA-seq data. And another tool for converting Gene Ontology files into the desired GMT format.  ShinyGO is another related tool that conducts enrichment analysis on a set of gene IDs.

1. Gene expression matrix

Upload a CSV (comma-separated value) or tab-delimited text files containing gene expression data. For RNA-seq data, gene-level read count data is recommended, as differentially expressed genes (DEGs) can be identified based on statistical modeling of counts using DESeq2. FPKM, RPKM, TPM or other types of normalized expression data (DNA microarray, proteomics, etc) are also accepted.  File size limit is 100 Mb.

Each row represents expression level of a gene. The first row is treated as sample names. Each column contains data on a sample with the first column as gene IDs. You do not need to log-transform your data before uploading to iDEP. The first few lines of a typical read-count file look like this:

WT_Rep1 WT_Rep2 hoxa1_Rep1 hoxa1_Rep2
ENSG00000198888 17528 23007 24418 29152
ENSG00000198763 21264 26720 28878 32416
ENSG00000198804 130975 151207 178130 196727
ENSG00000198712 49769 61906 66478 69758
ENSG00000228253 9304 11160 12608 13041

Gene IDs

iDEP can convert most types of common gene IDs to Ensembl gene IDs, which is used internally for enrichment analyses. It also uses gene IDs to guess what organism your data is derived from. See a list of plant and animal species covered by the database here. If your gene IDs does not match anything in the database, you can still use the website to do clustering analysis and identify DEGs.

If multiple IDs are mapped to the same ENSEMBL gene, only the one with largest standard deviation is kept. Gene IDs not recognized by iDEP will be kept in the data using original gene IDs.

Users can also avoid gene ID conversion by checking the “Do not convert gene IDs to Ensembl” checkbox in the “Pre-Process” page. This is useful when the user’s data is already Ensembl gene IDs, or the user just wants to conduct exploratory data analysis (EDA) and identify DEGs.

If you are working on a species covered by iDEP, but your ID is not recognized, you can send us a file that maps your IDs to Ensembl gene IDs. We will incorporate it into the system, and you should be able to use the website after a few days.

Gene IDs should not contain quotation marks like  ‘ or “.  This causes problems in the reading of data file and also the mapping of gene IDs.

Sample names

Name columns carefully as iDEP parses column names to define sample groups. Replicates should be denoted by “_Rep1”, “_Rep2”, “_Rep3” at the end. Or it can be simply “_1”, “_2”, “_3”. For example, Control_1, Control_2, TreatmentA_1, TreatmentA_2, TreatmentB_1, TreatmentB_2. The first part defines 3 sample groups that form the basis for differential expression analysis.  All pair-wise comparisons are listed and analyzed. Also, avoid using a hyphen “-” or a dot “.” in sample names. It affects the parsing of sample names. But underscore “_” is allowed.

More complex study designs can be represented by uploading a study design file. See below.

2. Fold-changes and P values from other programs

Users can also upload a file containing log fold-changes (LFCs) and P values that were derived from Cuffdiff or other programs. It does not even have to be RNA-seq data. iDEP will use these numbers directly for DEGs and downstream analysis such as clustering, Venn diagrams, enrichment analysis, and pathway analysis.  Below is an example of two comparisons (trtA, trtB) where LFC and FDR derived from Cuffdiff.

trtA_LFC FDR trtB_LFC FDR
Atp13a1 -1.6474 0.001 1.2 0.3
Pten -2.0093 0.8 0 0.12
Blnk 1.4008 0.02 -1.5 0.001
Btk -1.9195 0.34 2.3 0.01
Zmynd8 2.2636 0.12 -0.5 0.8
Pgd -2.2744 0.001 2.7 0.001

Users can also just upload log fold-changes. With such data, we can run EDA and pathway analysis.

LFC1 LFC2
Atp13a1 -1.6474 1.2
Pten -2.0093 0
Blnk 1.4008 -1.5
Btk -1.9195 2.3
Zmynd8 2.2636 -0.5
Pgd -2.2744 2.7

3. Experiment design file

You can upload a sample information file (CSV or tab-delimited text file) representing experimental design. This is especially useful when the experiment is complex or you want to control for batch effect or paired samples. The column names in design file must match the those in the expression matrix file.

A. Paired samples (ignored by many)

Paired samples are quite common, but are often ignored in analysis. For example, researchers want to know if eating ice cream will change gene expression in the mouse blood cells. On Monday they draw blood on the first mouse before and after feeding ice cream. To get more robust results, they repeated the experiment on Tuesday and Wednesday using genetically identical mice.  We have three replicates. But clearly the first control and first treatment samples are derived from the same mice and the experiment is done on Monday. There are special one-to-one relationship among control and treatment samples. To account for the systematic biases, we need to control for this paired effect. This is similar to the many situations of paired t-tests. To achieve this we need a design file like this:

Ctrl_1 Ctrl_2 Ctrl_3 Treat_1 Treat_2 Treat_3
Treatment Ctrl Ctrl Ctrl Treat Treat Treat
pairs 1 2 3 1 2 3

The sample names in the first row must match the sample names in the expression matrix file. Here we have two factors, “Treatment” and “pairs”.  In the DEG1 tab, we can choose pairs as a block factor. This will enable DESeq2 or limma to run paired tests by using a statistical model:  ~ group + pairs. More generally, block factors can be used to control many other confounding factors.

The levels of different factors must be distinct. Otherwise, iDEP will add factor names to these levels to avoid confusion.

Information in the sample information file can be used to label samples in visualizations. We can also define desired models, including interaction terms, using DESeq2 or limma.

B. 2×2 factorial design 

Many types of experiments can be represented as factorial designs.  This is a classic 2×2 factor example, where two genotypes are studied before and after treatment. For example, in one of our demo data, the mutant has p53 knocked out and the treatment is ionizing radiation (IR). The researchers wanted to study p53-dependent and independent pathways in IR response.

WT_Ctrl_1 WT_Ctrl_2 WT_Trt_1 WT_Trt_2 Mu_Ctrl_1 Mu_Ctrl_2 Mu_Trt_1 Mu_Trt_2
Genotype WT WT WT WT Mu Mu Mu Mu
Treatment Ctrl Ctrl Trt Trt Ctrl Ctrl Trt Trt

If we do not suspect any difference in IR response(two factors are independent), we could use a model:   ~ Genotype + Treatment.  The results include two comparisons:

  1. What is the difference between genotypes (WT vs Mu)?
  2. What is the response to Treatment (Trt vs Ctrl)?

C. 2×2 factorial design with interaction 

If we suspect that the mutant will response differently, then we should use the interaction term using this model:   ~ Genotype + Treatment + Genotype*Treatment

The interaction term will tell use the difference in response.  DESeq2 or limma will give us results to answer these questions:

  1. What is the response in WT,
  2. What is the response in Mu,
  3. The difference in response (interaction term),
  4. The difference between WT and Mu, without treatment,
  5. The difference between WT and Mu, with treatment.

Sometimes factorial design is used but not obvious. For example, a researcher knocked out two genes, ESR1 and ESR2, first separately and then both. Here we have four biological samples, wild-type, ESR1 knockout, ESR2 knockout, and double knock out (DKO). The researcher is very interested in the effect of double knockout. This can be represented by this design matrix:

WT_1 WT_2 ESR1_1 ESR1_2 ESR2_1 ESR2_2 DKO_1 DKO_2
ESR1 1WT 1WT 1KO 1KO 1WT 1WT 1KO 1KO
ESR2 2WT 2WT 2WT 2WT 2KO 2KO 2KO 2KO

You can set up a model in iDEP like : ~ ESR1 + ESR2 + ESR1:ESR2, where the last term represents interaction term, representing the special effects of double knockout.

D. 2×3, 3×3 designs

If two genotypes are treated with two drugs, then it is a 2×3 design. This is because the treatment factor now has 3 levels, i.e. control, treatment1, treatment2.

If three cell lines are treated with 2 drugs, then it is 3×3 design.

But we still have two factors. Just these factors are having more than 2 levels.

E. Three or more factors

Some researcher tried to get a lot of information into the design matrix. On human cancer samples, for example, they put in sex, disease type, etc. One common error is that the model is so complex that it requires a lot samples for the model to be fitted. This is specially true when you consider interaction terms with factors. Sometimes, users created a models with dozens of  samples, but they only provided 10 samples.

iDEP is designed to handle simple experiment designs. It is limited in dealing with complex models. Currently interactions terms using the limma package is only available for two factor design.

F. Time course data

Time points can be represented as factors too. Here researchers place plants with cold temperature and wanted to see gene expression change in both root and leaves.  Note that the untreated samples were labeled as 0 hour.

leaf_0h_1 leaf_0h_2 leaf_12h_1 leaf_12h_2 leaf_48h_1 leaf_48h_2 root_0h_1 root_0h_2 root_12h_1 root_12h_2 root_48h_1 root_48h_2
time 0h 0h 12h 12h 48h 48h 0h 0h 12h 12h 48h 48h
tissue leaf leaf leaf leaf leaf leaf root root root root root root

4. Upload pathway data for new species

Users can also upload their own gene set file for enrichment analysis and pathway analysis. This is possible by choosing “**NEW SPECIES**” in step 3 of the data upload page.

Gene set files must be in GMT format, which is explained in details here. An example is shown below. The first column is pathway ID, the 2nd column is some memo text that is ignored, the rest are gene IDs.  The gene IDs do not have to be Ensembl IDs, but must match the gene IDs in uploaded expression data. This is because no gene ID conversion will be done. Pathway data used by iDEP is available here.

GOBP_hsa_ens_GO:0000966_RNA_5-end_processing AnyText ENSG00000146109 ENSG00000197181 ENSG00000138035 ENSG00000174173 ENSG00000087269 ENSG00000120800
GOBP_hsa_ens_GO:0001503_ossification AnyText ENSG00000020633 ENSG00000087494 ENSG00000118785 ENSG00000136929 ENSG00000138756 ENSG00000159216 ENSG00000162337 ENSG00000168646 ENSG00000173868 ENSG00000197594 ENSG00000011028 ENSG00000012223 ENSG00000014123 ENSG00000017427 ENSG00000029559
GOBP_hsa_ens_GO:0001510_RNA_methylation AnyText ENSG00000059588 ENSG00000068438 ENSG00000104907 ENSG00000105202 ENSG00000108592 ENSG00000111641 ENSG00000122435 ENSG00000122687 ENSG00000126749 ENSG00000127804 ENSG00000130299 ENSG00000130305 ENSG00000134077 ENSG00000135297 ENSG00000138050

Sample size

To use limma or DESeq2 to identify DEGs, you should have at least two groups (biological samples), each with two or more replicates. For running exploratory data analysis such as PCA, hierarchical clustering and k-Means clustering, you can have just two data columns in your file.

R code for reading data:

x <- read.csv(inFile) # try CSV
if(dim(x)[2] <= 2 )   # ifless than 3 columns, try tab-deliminated
  x <-read.table(inFile, sep="\t",header=TRUE)       
x[,1] <- toupper(x[,1])
x = x[order(-apply(x[,2:dim(x)[2]],1,sd) ),]
x <- x[!duplicated(x[,1]) ,]
rownames(x) <- x[,1]
x <- as.matrix(x[,-1])

R code for parsing sample names:

detectGroups <- function(x){  # x are col names
     tem <-gsub("[0-9]*$","",x) # Remove all numbers from end
     tem <-gsub("_Rep$","",tem); # remove "_Rep" from end
     tem <-gsub("_rep$","",tem); # remove "_rep" from end
     tem <-gsub("_REP$","",tem)  # remove "_REP" from end
     return( tem )
 }
groups = detectGroups( colnames(data) )

As you could see from here, all numbers at the end of sample names are disregarded. Thus you can name your samples like WT1, WT2, Treatment1, Treatment2, etc.

Updated 8/6/2022

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close