Data format

iDEP accepts 4 types of files, namely:

  1. Gene expression matrix (example1, example2),
  2. fold-change and P values (example),
  3. experimental design (example), and
  4. gene sets (example).

These example files can also be accessed here.

To help users prepare these files we also developed a tool for extracting a column from multiple files. A tool for downloading processed public RNA-seq data. And another tool for converting Gene Ontology files into the desired GMT format.  ShinyGO is another related tool that conducts enrichment analysis on a set of gene IDs.

1. Gene expression matrix

Upload a CSV (comma-separated value) or tab-delimited text files containing gene expression data. For RNA-seq data, gene-level read count data is recommended, as differentially expressed genes (DEGs) can be identified based on statistical modeling of counts using DESeq2. FPKM, RPKM or other types of normalized expression data also accepted. File size limit is 100 Mb.

Each row represents expression level of a gene. The first row is treated as sample names. Each column contains data on a sample with the first column as gene IDs. You do not need to log-transform your data before uploading to iDEP. The first few lines of a typical read-count file look like this:

WT_Rep1 WT_Rep2 hoxa1_Rep1 hoxa1_Rep2
ENSG00000198888 17528 23007 24418 29152
ENSG00000198763 21264 26720 28878 32416
ENSG00000198804 130975 151207 178130 196727
ENSG00000198712 49769 61906 66478 69758
ENSG00000228253 9304 11160 12608 13041

Gene IDs

iDEP can convert most types of common gene IDs to Ensembl gene IDs, which is used internally for enrichment analyses. It also uses gene IDs to guess what organism your data is derived from. See a list of plant and animal species covered by the database here. If your gene IDs does not match anything in the database, you can still use the website to do clustering analysis and identify DEGs.

If multiple IDs are mapped to the same ENSEMBL gene, only the one with largest standard deviation is kept. Gene IDs not recognized by iDEP will be kept in the data using original gene IDs.

Users can also avoid gene ID conversion by checking the “Do not convert gene IDs to Ensembl” checkbox in the “Pre-Process” page. This is useful when the user’s data is already Ensembl gene IDs, or the user just wants to conduct exploratory data analysis (EDA) and identify DEGs.

If you are working on a species covered by iDEP, but your ID is not recognized, you can send us a file that maps your IDs to Ensembl gene IDs. We will incorporate it into the system, and you should be able to use the website after a few days.

Gene IDs should not contain quotation marks like  ‘ or “.  This causes problems in the reading of data file and also the mapping of gene IDs.

Sample names

Name columns carefully as iDEP parses column names to define sample groups. Replicates should be denoted by “_Rep1”, “_Rep2”, “_Rep3” at the end. Or it can be simply “_1”, “_2”, “_3”. For example, Control_1, Control_2, TreatmentA_1, TreatmentA_2, TreatmentB_1, TreatmentB_2. The first part defines 3 sample groups that form the basis for differential expression analysis.  All pair-wise comparisons are listed and analyzed. Also, avoid using a hyphen “-” or a dot “.” in sample names. It affects the parsing of sample names. But underscore “_” is allowed.

More complex study designs can be represented by uploading a study design file. See below.

2. Fold-changes and P values from other programs

Users can also upload a file containing log fold-changes (LFCs) and P values that were derived from Cuffdiff or other programs. It does not even have to be RNA-seq data. Instead of running differential gene expression analysis, iDEP will just use these numbers directly for downstream analysis such as clustering, Venn diagrams, enrichment analysis, and pathway analysis.  Below is an example of two comparisons (trtA, trtB) where LFC and FDR derived from Cuffdiff.

trtA_LFC FDR trtB_LFC FDR
Atp13a1 -1.6474 0.001 1.2 0.3
Pten -2.0093 0.8 0 0.12
Blnk 1.4008 0.02 -1.5 0.001
Btk -1.9195 0.34 2.3 0.01
Zmynd8 2.2636 0.12 -0.5 0.8
Pgd -2.2744 0.001 2.7 0.001

Users can also just upload log fold-changes. With such data, we can run EDA and pathway analysis.

LFC1 LFC2
Atp13a1 -1.6474 1.2
Pten -2.0093 0
Blnk 1.4008 -1.5
Btk -1.9195 2.3
Zmynd8 2.2636 -0.5
Pgd -2.2744 2.7

3. Experiment design file

Optionally, you can upload a sample information file (CSV or tab-delimited text file) representing experimental design. This is especially useful when the experiment is complex or you want to control for batch effect or paired samples.

WT_1 WT_2 WT_3 Hoxa1_1 Hoxa1_2 Hoxa1_3
genotype WT WT WT Mutant Mutant Mutant
batch A B C A B C

The sample names in the first row must match the sample names in the expression matrix file. Here we have two factors, “genotype” and “batch”. “genotype” has two levels (WT or Mutant), while “batch” has three levels (A, B, C) corresponding to sample pairs.  This will enable DESeq2 or limma to run paired tests by using a statistical model:  ~ group + batch.

The levels of different factors must be distinct. Otherwise, iDEP will add factor names to these levels to avoid confusion.

Information in the sample information file can be used to label samples in visualizations. We can also define desired models, including interaction terms, using DESeq2 or limma.

Many types of experiments can be represented as factorial designs.  This is a classic 2×2 factor example, where two genotypes are studied before and after treatment.

WT_Ctrl_1 WT_Ctrl_2 WT_Trt_1 WT_Trt_2 Mu_Ctrl_1 Mu_Ctrl_2 Mu_Trt_1 Mu_Trt_2
Genotype WT WT WT WT Mu Mu Mu Mu
Treatment Ctrl Ctrl Trt Trt Ctrl Ctrl Trt Trt

Sometimes factorial design is used but not obvious. For example, a researcher knocked out two genes, ESR1 and ESR2, first separately and then both. Here we have four biological samples, wild-type, ESR1 knockout, ESR2 knockout, and double knock out (DKO). The researcher is very interested in the effect of double knockout. This can be represented by this design matrix:

WT_1 WT_2 ESR1_1 ESR1_2 ESR2_1 ESR2_ DKO_1 DKO_2
ESR1 1WT 1WT 1KO 1KO 1WT 1WT 1KO 1KO
ESR2 2WT 2WT 2WT 2WT 2KO 2KO 2KO 2KO

You can set up a model in iDEP like : ~ ESR1 + ESR2 + ESR1:ESR2, where the last term represents interaction term, representing the special effects of double knockout.

Time points can be represented as factors too. Here researchers place plants with cold temperature and wanted to see gene expression change in both root and leaves.  Note that the untreated samples were labeled as 0 hour.

leaf_0h_1 leaf_0h_2 leaf_12h_1 leaf_12h_2 leaf_48h_1 leaf_48h_2 root_0h_1 root_0h_2 root_12h_1 root_12h_2 root_48h_1 root_48h_2
time 0h 0h 12h 12h 48h 48h 0h 0h 12h 12h 48h 48h
tissue leaf leaf leaf leaf leaf leaf root root root root root root

4. Upload pathway data for new species

Users can also upload their own gene set file for enrichment analysis and pathway analysis. This is possible by choosing “**NEW SPECIES**” in step 3 of the data upload page.

Gene set files must be in GMT format, which is explained in details here. An example is shown below. The first column is pathway ID, the 2nd column is some memo text that is ignored, the rest are gene IDs.  The gene IDs do not have to be Ensembl IDs, but must match the gene IDs in uploaded expression data. This is because no gene ID conversion will be done. Pathway data used by iDEP is available here.

GOBP_hsa_ens_GO:0000966_RNA_5-end_processing AnyText ENSG00000146109 ENSG00000197181 ENSG00000138035 ENSG00000174173 ENSG00000087269 ENSG00000120800
GOBP_hsa_ens_GO:0001503_ossification AnyText ENSG00000020633 ENSG00000087494 ENSG00000118785 ENSG00000136929 ENSG00000138756 ENSG00000159216 ENSG00000162337 ENSG00000168646 ENSG00000173868 ENSG00000197594 ENSG00000011028 ENSG00000012223 ENSG00000014123 ENSG00000017427 ENSG00000029559
GOBP_hsa_ens_GO:0001510_RNA_methylation AnyText ENSG00000059588 ENSG00000068438 ENSG00000104907 ENSG00000105202 ENSG00000108592 ENSG00000111641 ENSG00000122435 ENSG00000122687 ENSG00000126749 ENSG00000127804 ENSG00000130299 ENSG00000130305 ENSG00000134077 ENSG00000135297 ENSG00000138050

Sample size

To use limma or DESeq2 to identify DEGs, you should have at least two groups (biological samples), each with two or more replicates. For running exploratory data analysis such as PCA, hierarchical clustering and k-Means clustering, you can have just two data columns in your file.

R code for reading data:

x <- read.csv(inFile) # try CSV
if(dim(x)[2] <= 2 )   # ifless than 3 columns, try tab-deliminated
  x <-read.table(inFile, sep="\t",header=TRUE)       
x[,1] <- toupper(x[,1])
x = x[order(-apply(x[,2:dim(x)[2]],1,sd) ),]
x <- x[!duplicated(x[,1]) ,]
rownames(x) <- x[,1]
x <- as.matrix(x[,-1])

R code for parsing sample names:

detectGroups <- function(x){  # x are col names
     tem <-gsub("[0-9]*$","",x) # Remove all numbers from end
     tem <-gsub("_Rep$","",tem); # remove "_Rep" from end
     tem <-gsub("_rep$","",tem); # remove "_rep" from end
     tem <-gsub("_REP$","",tem)  # remove "_REP" from end
     return( tem )
 }
groups = detectGroups( colnames(data) )

As you could see from here, all numbers at the end of sample names are disregarded. Thus you can name your samples like WT1, WT2, Teatment1, Treatment2, etc.

Advertisements
%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close