Exercise: Classification with microarray data

From teaching

Jump to: navigation, search



This exercise will illustrate classification based on gene expression microarray data. We have chosen to focus on k-Nearest Neighbors (kNN) and Support Vector Machine (SVM) classifiers, because these classifiers have built-in cross-validation (at least in the R implementation).

We will use a publicly available data set that was obtained here. This data resulted from a study to find blood-based gene expression markers of Huntington's disease (HD) progression.

Copy the commands given in the exercise below into the R terminal.

Please write down your answers during the exercise, so you can compare the results obtained in the different parts of the exercise!

Getting started with the data

First we will load the data into R, and briefly examine it:

  1. Start a fresh R session.
  2. Check that your R workspace is empty:

    This will return character(0) if the workspace is empty.

    Actually, it isn't necessary for your workspace to be empty; we just recommend this to avoid confusion.

    If your workspace is not empty, this probably means you have saved previous work; recall that you can empty your workspace using rm(list = ls()). If you want to avoid loading this data automatically each time you start R, you can remove the file ".RData" from the working directory. But it is also fine if you want to leave it.

  3. Load the Biobase package, which includes (among other things) functions for working with ExpressionSet objects:
  4. Load the data for this exercise directly from the CBS server:

    Note: Alternatively, you could first download the data to your machine and then load it from there.

    Confirm that you have successfully loaded the data:


    You should now have an object called "gds1331" loaded in your workspace. This object is an ExpressionSet that was previously created using the GEOquery package.

  5. Examine the data:

    Typing the name of an ExpressionSet object returns a brief summary. Under the "phenoData" section, then under the "varLabels" subsection, you can see what data is available for each patient. We are interested in the "disease.state". You can access the phenoData slots using the dollar sign, like this:


    The table() function can save you a bit of time counting:


    How many patients have Huntington's disease? (count both presymptomatic and symptomatic)

  6. In this exercise, you will attempt to distinguish Normal ("N", not HD) patients from patients with HD (either presymptomatic or symptomatic). You must define the classes you are interested in predicting:
    known.classes <- factor(ifelse(gds1331$disease.state == "normal", yes = "N", no = "HD"))

    Do you understand what the "ifelse()" function does? If not, look at its help page!

    Confirm that you now have a vector of known classes:


    How many patients of each class do we have?


  7. Look at the first few rows of the expression matrix:

    This displays the first 6 rows of the expression matrix, where each row corresponds to a different gene (labeled by Affymetrix ID), and each column corresponds to a different patient. The values in the matrix are gene expression values (MAS5) for each combination of sample and gene.

Classification using k-nearest neighbors (kNN)

kNN using all genes

Continue the R session started above. Enter the following commands to perform a k-nearest neighbor classification based on the Euclidian distance with all 22283 genes.

  1. Load the class package, which contains various functions for classification:
  2. Create the input matrix for the classifier. In this first exercise, we will simply use the entire expression matrix (all 22283 genes):
    mat1 <- exprs(gds1331)
  3. Perform k-nearest neighbor classification with leave-one-out-cross-validation (LOOCV) and k=3:
    predictions1 <- knn.cv(train = t(mat1), cl = known.classes, k = 3)

    Note: here we have used the function t() to transpose the input matrix, because the function knn.cv() expects the samples to correspond to rows rather than columns.

  4. Check your classification accuracy by comparing the known classes to the predictions:

    A slightly easier way to compare these two vectors is to view them together in a data frame:

    data.frame(known.classes, predictions1)

    Better yet, you can summarize this data with a confusion matrix:

    table(known.classes, predictions1)

Question 1. How good is your classifier at classifying the 31 patients: number of true positives (TP), true negatives (TN), false negatives (FN), false positives (FP)? (define "positive" = "HD")

Question 2. What is the classification accuracy?

Question 3. What is the sensitivity? Specificity?

Question 4. How useful do you think this classifier would be for diagnosing patients?

kNN using all genes, log-transformed

Now we attempt to improve on the classification performance by taking the logarithm of the data. This transformation tends to reduce the influence of very highly expressed genes, which can be susceptible to saturation and therefore unreliable. Any logarithm base could conceivably be used, but in microarray data analysis it is traditional to use log base 2.

  1. Create the input matrix for the classifier, this time using a log (base 2) transformation:
    mat2 <- log2(exprs(gds1331))
  2. Perform k-nearest neighbor classification with leave-one-out-cross-validation (LOOCV) and k=3:
    predictions2 <- knn.cv(train = t(mat2), cl = known.classes, k = 3)
  3. Check your classification accuracy as done previously.
    table(known.classes, predictions2)

Question 5. How well does this classifier perform compared to the classifier based on the untransformed data?

kNN using filtered genes

Often it is helpful to filter out genes that are expressed only at low levels, since these expression values are especially susceptible to measurement error.

  1. For each gene, calculate the maximum over all samples:
    rowmax <- rowMax(exprs(gds1331))
  2. Create a subset of the original log-transformed matrix, including only genes whose expression value exceeds 200 in at least one sample.
    mat3 <- mat2[rowmax > 200, ]
  3. Perform kNN classification as you did previously, using this matrix as the training data.

Question 6. How many genes have you selected?

Question 7. How well does the resulting classifier perform?

Classification using support vector machine (SVM)

To avoid overfitting to sparse high-dimensional data, we will reduce the dimensionality of the data set, by performing principal components analysis (PCA) and selecting two principal components as input to SVM.

  1. If you have not already done so, install the e1071 package, which implements SVM. (you only need to do this once)
  2. Load the e1071 package:
  3. First, perform PCA on the log-transformed expression values:
    pca <- prcomp(t(log2(exprs(gds1331))))
  4. Visualize the first two principal components:
    plot(x = pca$x[,2], y = pca$x[,1], 
      xlab = 'PC2', ylab = 'PC1',
      col = as.integer(known.classes))
    legend('bottomright', legend = c('HD', 'Normal'), pch = 1, col = 1:2, bg = 'gray90')

    Hmmm... this looks like it might be difficult! (Why?) But we can try it anyway.

  5. Fit the SVM using the first two principal components as input data:
    data1 <- data.frame(cl = known.classes, pca$x[,c(1, 2)])
    model1 <- svm(cl ~ ., data = data1, cross = 31)

    Note: setting "cross = 31" does LOOCV.

  6. Visualize the classification space:
    plot(model1, data1)

    New patients can be classified according to where they fall in this diagram. In case you're curious, "x" indicates the support vectors (i.e. the patients that constrain the margin), and "o" indicates the other patients.

  7. Generate a confusion matrix:
    table(known.classes, model1$fitted)
  8. Get the cross-validation accuracy:

    OK. What if we had used a different set of principal components (instead of the first two)?

  9. Try visualizing principal components 1 and 3:
    plot(x = pca$x[,3], y = pca$x[,1], 
      xlab = 'PC3', ylab = 'PC1',
      col = as.integer(known.classes))
    legend('bottomright', legend = c('HD', 'Normal'), pch = 1, col = 1:2, bg = 'gray90')

    Here, the separation between the two classes looks a bit better! Maybe PC1 and PC3 will be better features to use for the SVM.

  10. Fit the SVM using PC1 and PC3 as the input data:
    data2 <- data.frame(cl = known.classes, pca$x[,c(1, 3)])
    model2 <- svm(cl ~ ., data = data2, cross = 31)
  11. Visualize the classification space:
    plot(model2, data2)
  12. Generate a confusion matrix:
    table(known.classes, model2$fitted)
  13. Get the cross-validation accuracy:

Question 8. Which set of principal components performs best as inputs into the classifier?

Optional additional problems

  1. Try modifying parameters in the classifiers above. E.g. try different values of k in kNN, or try different kernels in SVM. Does your accuracy improve?

  2. Try to implement an LDA-based classifier. Hints:
Personal tools