R : using Bioconductor
Report: Gene Expression profiling of paediatric AML
AIM:
The aim of this experiment is to identify pediatric AML subtypes based on gene expression profiles of leukemic blasts. Sub type identification can facilitate the development of personalized treatment options, specific to the gene expression profile. This could by extension, improve the statistics of AML patients cure.
Introduction:
The 5 known risk groups of AML (subtypes) covered in this experiment:
- t(15;17)[PML-RAR]
- t(8;21)[AML1-ETO],
- inv16[CBF-MYH11]
- MLL chimeric fusion genes
- acute megakaryocytic morphology (FAB-M7)
In order to classify samples in distinctive subtypes (classes), oligonucleotide microarrays were utilized to analyze the expression of more than 22,000 genes in diagnostic leukemic blasts from 84 pediatric AML patients; The data was divided randomly into two set, training set (with 62 samples) and test set (with 21 samples). Train data was used to model a support vector machine (SVM) that is used to classify the samples into specific subtype class, and the test data was used to validate this SVM.
|
Table 1 : Paediatric AML subgroup distribution |
||
| Genetic Subgroup | train | test |
| t(15;17)[PML-RAR] | 11 | 4 |
| t(8;21)[AML1-ETO] | 16 | 5 |
| inv16[CBF-MYH11] | 10 | 4 |
| MLL chimeric fusion genes | 18 | 5 |
| (FAB-M7) | 7 | 3 |
| Total number of sample | 63 | 21 |
Step 0 : Before we begin – Load Libraries & Train Data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | #===============================================================================
# Load libraries
#===============================================================================
# Clear workspace
rm(list=ls())
# Close any open graphics devices
graphics.off()
library(affy)
library(affyPLM)
library(genefilter)
library(limma)
library(e1071)
library(Biobase)
library(annaffy)
#===============================================================================
# Load Train data from affymetrix '.CEL' files
#===============================================================================
cat("\n\n Loading Train Data...\n\n")
# first create an 'AnnotatedDataFrame' object with an experiment description file
Train.Expt <- read.AnnotatedDataFrame("train_set.txt", header=TRUE, row.names=1, sep="\t")
# then load CEL files specified in the experiment description file (generates an
# AffyBatch object)
Train.Data <- ReadAffy(filenames=rownames(pData(Train.Expt)), phenoData=Train.Expt, verbose=TRUE)
# check the data
dim(Train.Data) #712 712
show(Train.Data)
pData(Train.Data)
head(exprs(Train.Data)) #first lines of expression data contained in Affy.Data |
Step 1 : Quality assesement of raw Data
To Assess data quality at ‘probe level’.
The plots clearly shows that it requied pre-processing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #===============================================================================
# Step 1: QC plots on raw train data
#===============================================================================
cat("\n\nHit a key to proceed to QC plots on raw data...
\nGraphs will be saved in the current directory\n\n")
scan("")
#creating a vector for colours
pdat<-pData(Train.Data)
colour<-NULL
for(i in 1:length(pdat$Sample.type)){
clas<-pdat$Sample.type[i]
if(clas=="inv16-"){
colour=c(colour,"green")
} else if(clas=="MLL-"){
colour=c(colour,"cyan")
} else if(clas=="Other-M7-"){
colour=c(colour,"#B580FE")
} else if(clas=="t(15:17)-"){
colour=c(colour,"magenta")
}else if(clas=="t(8:21)-"){
colour=c(colour,"yellow")
} else{
colour=c(colour,"white")
}
}
# box-and-whisker plot
tiff(filename="boxplot_Raw_TrainData.tif")
boxplot(Train.Data, col= colour, las=3, cex.axis=0.5, main = "Boxplot of raw data")
legend(-2, 15, horiz = TRUE, c("inv16-", "MLL-", "Other-M7-", "t(15:17)-", "t(8:21)-"),
fill = c("green", "cyan", "#B580FE", "magenta", "yellow"))
dev.off()
# Data distribution per array
tiff(filename="histogram_Raw_TrainData.tif")
hist(Train.Data,type="l", col=colour,main = "Histogram of raw data")
legend(12, 1.5, c("inv16-", "MLL-", "Other-M7-", "t(15:17)-", "t(8:21)-"),
fill = c("green", "cyan", "#B580FE", "magenta", "yellow"))
dev.off()
# Chip images (for first six array)
par(mfrow=c(2,3))
for (i in 1:6){ #for (i in 1:ncol(exprs(Train.Data))){
image(Train.Data[,i])
}
# to save time, just save one image
tiff(filename="ChipImage1_Raw_TrainData.tif")
image(Train.Data[,1])
dev.off()
# RNA degradation
tiff(filename="RNAdeg_Raw_TrainData.tif")
RNA.deg <- AffyRNAdeg(Train.Data)
plotAffyRNAdeg(RNA.deg)
dev.off()
cat("\n\n the plots have been saved in the current directory...\n\n") |

Figure 1 : the box plot shows uneven distribution - the data requires pre-processing, especially normalisation

Figure 2: the histogram also shows the uneven distribution - also confirming the need for pre-processing

Figure 3: the RNA degradation plot shows the negative gradients at the right end; hence confirms that pre-processing is necessary
Step 2: Pre-processing – background correction, normalisation, summarisation
- Background correction: RMA
- Normalisation: quantile
- summarisation: Median.polish
The plots shows the dataset has improved vastly. Nonetheless, there seems to be a little degree of factuality between the samples, hence, deducing that filtering would be required.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
#===============================================================================
# Step 2: Preprocessing of Train data
#===============================================================================
cat("\n\nHit a key to proceed to Normalisation of Train Data set...\n\n")
scan("")
# Close any open graphics devices
graphics.off()
Train.eset <- threestep(Train.Data,background.method="RMA.2",
normalize.method="quantile",summary.method="median.polish")
#===============================================================================
# QC on pre-processed data at probe set level (i.e. genes)
#===============================================================================
cat("\n\nProceeding to QC on pre-processed data...
\nGraphs will be saved in the current directory\n\n")
dim(Train.eset) #Features: 22283 Samples: 62
# Boxplot
tiff(filename="boxplots_norm_TrainData.tif")
boxplot(as.data.frame(exprs(Train.eset)), las=3, cex.axis=0.5, col=colour,
pch=".",main = "Boxplots normalised data")
legend(-2, 14, horiz = TRUE, c("inv16-", "MLL-", "Other-M7-", "t(15:17)-", "t(8:21)-"),
fill = c("green", "cyan", "#B580FE", "magenta", "yellow"))
dev.off()
# Scatter matrix of first 6 arrays against one another
tiff(filename="scatter_norm_TrainData.tif")
pairs(exprs(Train.eset)[,1:6], pch=".",main="Scatter plots",col=colour)
dev.off()
# MA plots of first 6 arrays against one another (log-intensity vs log-ratio)
tiff(filename="MA_plot_norm_TrainData.tif")
mva.pairs(exprs(Train.eset)[,1:6],,col=colour)
dev.off()
cat("\n\n the plots have been saved in the current directory...\n\n") |

Figure 6: the MVA plots show that even if the data is normalised, the plots do look fairly good, but the data needs to be filtered
Step 3: Filtering
This step reduced the number genes by selecting for those of potential interest.
Three filtering steps:
- remove all genes that showed no or very low expression in all samples
- a function was created that filtered all the genes with the expression level above log2(10), valid cut-off point in relation to the dataset
- The variance across the probes was analysed and top 1000 potential genes were selected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
#===============================================================================
# Step 3: Filter
#===============================================================================
cat("\n\nHit a key to proceed to Filtering & getting a subset...\n\n")
scan("")
#
# Filter out genes that are absent in all samples
#
#Performs the Wilcoxon signed rank-based gene expression presence/absence
eset.mas5 <- mas5calls(Train.Data)
#create a function
mascallsfilter <- function(cutoff = "A", number){
function(x){
sum((x) == (cutoff)) != number
}
}
#filtering the absent genes
m5cfun <- mascallsfilter(number=dim(eset.mas5)[2])
mcfun <- filterfun(m5cfun)
mcsub<-genefilter(eset.mas5,mcfun)
sum(mcsub) #16073
#a subset of samples (it no longer has aby absent genes)
X <- exprs(Train.eset[mcsub,])
dim(X) # 16073 62
#
# Filter genes with an expression level log2(10)
# log2(10)because it seemed reasonable value (considering the scale of the data)
#
#create a function
kfilt <- kOverA(dim(X)[2],log2(10))
kff <- filterfun(kfilt)
mksub <- genefilter(X, kff)
sum(mksub) # 13029
#a subset of samples (it no longer has aby absent genes or genes with an expression level below log2(10))
X <- exprs(Train.eset[mksub,])
dim(X) #18238 62
#
# filtering for top 1000 genes with highest variance
#
#assess the variance of a probe set across all samples.
vars <- apply(X, 1, var)
#Filtering most significant genes based on varaince
hvsub <- vars > quantile(vars, (nrow(X) - 1000)/nrow(X))
sum(hvsub) #1000
#===============================================================================
# subset of training data (after filtering)
#===============================================================================
# subset the RMA expression set with the genes selected above
X <- exprs(Train.eset)
dim(X)
X <- X[mksub,]
dim(X) #18238 62
TrainSub<-Train.eset[mksub,]
exprs(TrainSub) <- X
X <- X[hvsub,] ### only retain best genes
dim(X) #1000 62
TrainSub<-TrainSub[hvsub,]
exprs(TrainSub) <- X |
Step 4: Exploratory analysis (HCA) and Heatmap
The genes selected after filtering were analysed using unsupervised 2-dimensional hierarchical clustering algorithm (HCA) to assess the major grouping of the subtypes based exclusively on their gene expression profiles. The clusters identified most of the major subgroups.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
#===============================================================================
# Step 4: Exploratory analysis (HCA) and Heatmap
#===============================================================================
cat("\n\nProceeding to QC on filtered data...
\nGraphs will be saved in the current directory\n\n")
A<-t(exprs(TrainSub))
dim(A) #62 1000
label<-as.vector(TrainSub$Sample.type)
#creating a vector for colours
pdat<-pData(TrainSub)
colour<-NULL
for(i in 1:length(pdat$Sample.type)){
clas<-pdat$Sample.type[i]
if(clas=="inv16-"){
colour=c(colour,"green")
} else if(clas=="MLL-"){
colour=c(colour,"cyan")
} else if(clas=="Other-M7-"){
colour=c(colour,"#B580FE")
} else if(clas=="t(15:17)-"){
colour=c(colour,"magenta")
}else if(clas=="t(8:21)-"){
colour=c(colour,"yellow")
} else{
colour=c(colour,"white")
}
}
#computes the distance matrix
my.dist <- function(x) dist(x, method="euclidean")
#Hierarchical cluster analysis on the filtered genes
my.hclust <- function(d) hclust(d, method="ward")
#create and save heatmap
tiff(filename="HeatMap.tif")
heatmap(A, distfun=my.dist, hclustfun=my.hclust, labRow=label, col=colour,main="Heatmap of filtered Data")
dev.off()
#create Hierarchical cluster analysis dendogram
d <- as.dist(dist(A, method="euclidean"))
fit <- hclust(d, method="ward")
#save Hierarchical cluster analysis dendogram
tiff(filename="Hclust.tif")
plot(fit, hang=-1, labels=label)
dev.off()
#PCA
#pca <- prcomp(A, scale=T) # Performs principal component analysis after scaling the data.
#summary(pca) # Prints variance summary for all principal components.
#library(scatterplot3d) # Loads 3D library.
#scatterplot3d(pca$x, pch=20, color=colour)
cat("\n\n the plots have been saved in the current directory...\n\n") |

Figure 8: Unsupervised cluster analysis of paediatric AMLs - the clusters are highlughted by colours
Step 5: Statistical Analysis for class discriminating genes
To identify expression signatures for each of the subtypes and to distinguish the class discriminating genes out of the 1000 genes that were selected after filtering; by applying parametric tests to rank genes with respect to the resulting p-values.
Steps:
- A design matrix the represented the experimental design was
- a linear model was fit to the data set.
- A contrast matrix was generated by venturing the contrasts of difference between the subtypes.
- The contrast matrix was fit to the linear model which was used to compute moderate t-statistics using Bayesian model, ebayes. The ebayes analysis is analogous to the standard single-sample t-test with the exception of making use of information between genes.
- The raw p values were corrected
- the genes with adjusted p-values less than 0.05 were selected, 483 altogether, were used for classification.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
#===============================================================================
# Step 5: Statistical Analysis for class discriminating genes
#===============================================================================
cat("\n\nHit a key to proceed to Statistical analysis using ebayes...\n\n")
scan("")
pda<-pData(TrainSub)
clas<-pda$Sample.type
# create design matrix
design <- model.matrix(~-1+factor(clas))
colnames(design)<- c("inv16","MLL","Other_M7","t15_17","t8_21")
# fit linear model to the data
fit.train <- lmFit(TrainSub, design)
dim(fit.train) # 1000 5
# create contrasts (group comparisons) of interest, in this case 10 comp.
contrast.matrix <- makeContrasts(
+ inv16 - MLL, inv16 - Other_M7, inv16 - t15_17, inv16 - t8_21,
+ MLL - Other_M7, MLL - t15_17, MLL - t8_21,
+ Other_M7 - t15_17, Other_M7 - t8_21, t15_17 - t8_21, levels=design)
dim(contrast.matrix) #5 10
# fit contrasts to linear model
fit2.train <- contrasts.fit(fit.train, contrast.matrix)
dim(fit2.train) # 1000 10
# Given a series of related parameter estimates & compute moderated t-statistics
eb.train <- eBayes(fit2.train)
dim(eb.train) # 1000 10
#write.table(eb.train, file="eb.train.txt", quote=F, sep="\t")
# correct raw p-values for multiple testing, (Benjamini & Hochberg)
#and give top-ranked genes
top.all <- topTable(eb.train,n=nrow(exprs(TrainSub)),adjust="BH",coef=2)
#===============================================================================
# Discriminating Genes
#===============================================================================
#getting a subget of only discriminating genes
class.sub<-top.all$ID[top.all$adj.P.Val<0.05]
length(class.sub) #483
Train.sub <- TrainSub[class.sub,]
#===============================================================================
# Visualisation of results
#===============================================================================
cat("\n\nProceeding to QC on results from ebayes...
\nplots will be saved in the current directory\n\n")
#observing unadjusted p-values
tiff(filename="histpvalues.tif")
hist(top.all$P.Value,main='Histogram of P-values')
dev.off()
#comparing unadjusted and adjusted p-values
tiff(filename="pvalues.tif")
plot(top.all$P.Value,top.all$adj.P.Val,main="p values vs adjusted p values")
abline(0,1)
dev.off()
# Moderated t-statistic
tiff(filename="ebayes.tif")
qqt(eb.train$t,df=eb.train$df.prior+eb.train$df.residual,
main="Moderated t-statistic")
abline(0,1) # Points off the line may be differentially expressed
dev.off()
#Volcano plot of log2 foldchanges versus significance (+label top 10)
tiff(filename="volcano.tif")
plot(top.all$logFC,-log2(top.all$P.Value),pch="+",cex=0.1,
xlab="Log Fold-Change from non-filtered eBayes",
ylab="-log10 P-Value from non-filtered eBayes",
main='Volcano Plot')
t <- is.element(top.all$ID,top.all$ID[top.all$adj.P.Val<0.05])
points(top.all$logFC[t],-log10(top.all$P.Value[t]),pch=16,col='red')
t <- is.element(top.all$ID,top.all$ID[top.all$adj.P.Val<0.05][1:25])
points(top.all$logFC[t],-log10(top.all$P.Value[t]),pch=7,col='black')
legend(-4.5,69,
c('sig. eBayes (p.value<0.05)','top 25 eBayes (p.value<0.05)','other'),
pch=c(16,7,16),col=c('red','black','grey85'),cex=1)
dev.off()
#MA plot
tiff(filename="MA_plot.tif")
M <- top.all$logFC #log fold-change
A <- top.all$AveExpr # ave. expr. level
mat <- cbind(A,M)
plotMA(mat,main="MA plot")
abline(h=3,col="red")
dev.off()
cat("\n\n the plots have been saved in the current directory...\n\n") |

Figure 9: the graph showing p values against the adjusted p values. adjusted p values were used in this analysis

Figure 10: Histogram of p -values - shows that most of the genes have a lower p-value, and removing genes with p-values>0.05 and classing the as insignificant will remove right genes that is causing the right skewness of the histogram

Figure 12: MA plot – seems normal – it is centred at M=0 and there is no trend to the M vs. A relationship
CLASSIFICATION
Step 6: SUPPORT VECTOR MACHINES (SVM)
SVM models are a popular classification method for highly multivariate data like this. The SVM was trained using type ‘C-SVM’ and a linear kernel as function settings. Once the SVM was trained, the number of samples misclassified was tested on the training set; result was zero misclassification.
The performance of SVM was also accessed by 10 fold cross validation on the training set; result was 100% accuracy for all the subtype.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
#===============================================================================
# CLASSIFICATION
#===============================================================================
cat("\n\nHit a key to proceed to Classfication of genes using SVM...\n\n")
scan("")
#===============================================================================
# Step 6:. SUPPORT VECTOR MACHINES
#===============================================================================
#===============================================================================
# Train SVM
#===============================================================================
#library(e1071)
#library(Biobase)
cat("\n\nTraining the SVM using results from Train data set...\n\n")
Xm <- t(exprs(Train.sub))
dim(Xm) #62 944 #483
resp <- phenoData(Train.sub)$Sample.type
length(resp) #62
#train a support vector machine.
svm.model <- svm(Xm, resp, type="C-classification", kernel="linear")
#View the model.
print(svm.model)
summary(svm.model)
#===============================================================================
# Estimate training error
#===============================================================================
cat("\n\nEstimating training error...\n\n")
#Check if the model is able to classify your *training* data.
# test on training set
trpred <- predict(svm.model, Xm)
# count differences
sum( trpred != resp) #0 (no misclassification)
# confusion matrix
table(resp, trpred)
#===============================================================================
# 10 fold cross-validation
#===============================================================================
cat("\n\nCross Validation to assess training perfomance...\n\n")
trcv <- svm(Xm, resp, type="C-classification", kernel="linear",cross=10)
summary(trcv) #accuracy: 98.3871 |
Step 7: Predict classes of test set
The SVM was then used on the test data set which would determine the true accuracy of the classification; result was zero misclassification.
This provides validation that that the SVM has a high extent of accuracy for classification of samples into its respective subtypes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
#===============================================================================
# Step 7: Predict classes of test set
#===============================================================================
#=======================
# Modify test set
#=======================
cat("\n\nLoading, normalization and filtering test data set...\n\n")
#Load data
Test.Expt <- read.AnnotatedDataFrame("test_set.txt", header=TRUE, row.names=1, sep="\t")
Test.Data <- ReadAffy(filenames=rownames(pData(Test.Expt)), phenoData=Test.Expt, verbose=TRUE)
#pre-processing
Test.eset <- threestep(Test.Data,background.method="RMA.2",normalize.method="quantile",summary.method="median.polish")
#filter
Xt <- exprs(Test.eset)
TestSub <- Test.eset[class.sub,]
exprs(TestSub) <- Xt[class.sub,]
#=========================
# Predict classes
#=========================
cat("\n\nPredicting the classes of training set...\n\n")
#Validate dataset.
Xmtr <- t(exprs(TestSub))
tepred <- predict(svm.model, Xmtr)
resptr <- phenoData(TestSub)$Sample.type
sum(tepred != resptr) #0 (no misclassification)
table(tepred, resptr) |
|
Table 2: Diagnostic accuracies for paediatric AML |
||
| Genetic Subgroup | Training, n = 62 | Test, n = 21 |
| t(15;17)[PML-RAR] | 100 | 100 |
| t(8;21)[AML1-ETO] | 100 | 100 |
| inv16[CBF-MYH11] | 100 | 100 |
| MLL chimeric fusion genes | 100 | 100 |
| (FAB-M7) | 100 | 100 |
Step 8: Random replacement of classes from train set
A test was tried by randomly replacing the labels on the training set and redoing the SVM classification on the missed up samples. The result gave a very low accuracy, usually less than 40%.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
#===============================================================================
# Step 8: Random replacement of classes from train set
#===============================================================================
cat("\n\nRandom replacement of classes from train set...\n\n")
newlabs <- sample(c(rep("A",10), rep("B",18),rep("C",7),
rep("D",11), rep("E",16)), 62)
trainlabs<-phenoData(Train.eset)$Sample.type
table(newlabs, trainlabs)
funnysvm <- svm(Xm, newlabs, type="C-classification", kernel="linear")
fpred <- predict(funnysvm, Xm)
sum( fpred != newlabs)
table(fpred, newlabs)
#cross validation
trfcv <- svm(Xm, newlabs, type="C-classification", kernel="linear",cross=10)
summary(trfcv) #35 : accuracy |
Step 9: Classification of the other data set (Adult samples)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
#===============================================================================
# Step 9: Clssaification of the other data set (ADULT SET)
#===============================================================================
cat("\n\nHit a key to proceed to Classification of the other data set ...\n\n")
scan("")
#=======================
# Modify data set
#=======================
#Load data
other.Expt <- read.AnnotatedDataFrame("other_set.txt", header=TRUE, row.names=1, sep="\t")
other.Data <- ReadAffy(filenames=rownames(pData(other.Expt)), phenoData=other.Expt, verbose=TRUE)
#pre-processing
other.eset <- threestep(other.Data,background.method="RMA.2",normalize.method="quantile",summary.method="median.polish") ####i wrote this
#filter and get subset
Xo <- exprs(other.eset)
OtherSub <- other.eset[class.sub,]
exprs(OtherSub) <- Xo[class.sub,]
#=========================
# Predict classes
#=========================
#predict
oXm <- t(exprs(OtherSub))
opred <- predict(svm.model, oXm)
write.table(opred, file="Otherpred.txt", quote=F, sep="\t")
#cross validation
o<- svm(oXm,opred, type="C-classification", kernel="linear",cross=10)
summary(o)
cat("\n\nClassification of the other data set complete ! \n\n")
#===============================================================================
# Annotation table
#===============================================================================
#library(rgu34a.db)
annotation(TrainSub) #"hgu133a"
# library(hgu133a)
# generate an annotation handler
annot.colums <- aaf.handler()[c(1:5, 7,10,12)]
# generate an annotation table of the 10 genes with the lowest p-value
#anntable<- aafTableAnn( opred,"rgu34a.db",annot.colums)
# save the result in an html table
#saveHTML(anntable, "opred.html", title = "Prediction") |
The dataset with adult samples came with 5 additional pediatric cases of T-ALL with MLL translocations, hence that dataset 25 sample altogether. 10 Fold validation gave a total percentage accuracy of 88%.
| Table 3a:
T-ALL case distribution |
Table 3b:
Adults case distribution Results and results from the paper |
Difference between results and the results from paper (Adults cases only) |
||
| Genetic Subgroup | No. of cases (T-ALL) | No. of cases (Adults) | No. of cases (Adults) from paper | |
| t(15;17)[PML-RAR] | 0 | 4 | 4 | 0 |
| t(8;21)[AML1-ETO] | 0 | 7 | 6 | 1 |
| inv16[CBF-MYH11] | 0 | 3 | 2 | 1 |
| MLL chimeric fusion genes | 5 | 6 | 6 | 0 |
| (FAB-M7) | 0 | 0 | 2 | 2 |
| Total | 5 | 20 | 20 | 2 genes are mismatched
(from FAB-M7) |
These results are slightly different from that from the paper, it could probably be because different techniques were utilized by the paper, such as SAM and ANN instead of ebayes and SVM.






