2008-12-07

The Distant Segments Kernel: a tutorial

This post is a tutorial on how to use the distant segments kernel. String kernels were recently introduced as a more precise way to perform pairwise comparisons. First, I will define the concept of kernels. A kernel is a function that takes two objects in an input space and multiply them by mapping them to a vectorial feature space. A kernel associates a real number to a pair of instances. String kernels are a particular case of kernels. The input space of string kernels is the set of strings. Recall that strings are sequences generated with a given alphabet. The distant segments kernel is a string kernel. For the distant segments kernel, the feature vector associated to a string is the distribution of its distant segments. See this paper for more details.

In this tutorial, command lines are shown in red. First, download the source code.


seb@ubuntu:~$ wget http://boisvert.info/software/PermutationDSKernel.cpp
This software performs the kernel matrix computation of a set of strings. Now, compile the c++ file with g++.

seb@ubuntu:~$ g++ PermutationDSKernel.cpp -O4 -Wall -o DSK

Executing the program without providing parameters will output the usage.

seb@ubuntu:~$ ./DSK
Usage:
DSkernel l sequenceFile delta_m theta_m out

Basically, sequenceFile is a file containing sequences. In this file, you must put one sequence per line. Note that this file is not a fasta file.

In this tutorial, I will use the file sequences.txt. You can copy the content below in sequences.txt on your computer. The sequences.txt file contains protein sequences from the SCOP database.

seb@ubuntu:~$ cat sequences.txt
svydaaaqltadvkkdlrdswkvigsdkkgngvalmttlfadnqetigyfkrlgnvsqgmandklrghsitlmyalqnfidqldnpddlvcvvekfavnhitrkisaaefgkingpikkvlasknfgdkyanawaklvavvqaal
lsaaqkdnvksswakasaawgtagpeffmalfdahddvfakfsglfsgaakgtvkntpemaaqaqsfkglvsnwvdnldnagalegqcktfaanhkargisagqleaafkvlagfmksyggdegawtavagalmgmirpdm
glsaaqrqviaatwkdiagadngagvgkkclikflsahpqmaavfgfsgasdpgvaalgakvlaqigvavshlgdegkmvaqmkavgvrhkgygnkhikaqyfeplgasllsamehriggkmnaaakdawaaayadisgalisglqs
vlsegewqlvlhvwakveadvaghgqdilirlfkshpetlekfdrfkhlkteaemkasedlkkhgvtvltalgailkkkghheaelkplaqshatkhkipikylefiseaiihvlhsrhpgdfgadaqgamnkalelfrkdiaakykelgy
xslsaaeadlagkswapvfanknangldflvalfekfpdsanffadfkgksvadikaspklrdvssriftrlnefvnnaanagkmsamlsqfakehvgfgvgsaqfenvrsmfpgfvasvaappagadaawtklfgliidalkaaga
lsadqistvqasfdkvkgdpvgilyavfkadpsimakftqfagkdlesikgtapfethanrivgffskiigelpnieadvntfvashkprgvthdqlnnfragfvsymkahtdfagaeaawgatldtffgmifskm
seb@ubuntu:~$ wc -l sequences.txt
6 sequences.txt

To perform a pairwise comparison, one can use string kernels. Note that pairwise comparisons are usually done with alignment software. Instead of relying on an alignment to perform this task, here we will utilize the computer program that we just compiled from its source code.

seb@ubuntu:~$ ./DSK 6 sequences.txt 1000 3 KernelMatrix-DSK-1000-3.txt

This performed the computation for the 6 sequences in sequences.txt with the distant segments kernel. Hyperparameters deltaM and theteM were set respectively to 1000 and 3. See the paper for a clear explanation of the role of those hyperparameters in the distant segments kernel. The last argument that we provided is simply the output file.

The output file contains a square matrix. For each pair of sequences, it contains the image of the distant segments kernel. Let us look at the content of this file.

seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt
x 0 1 2 3 4 5
0 95925 5707 6163 5114 6218 4130
1 5707 93627 9197 5396 9088 5900
2 6163 9197 103266 6711 8617 5547
3 5114 5396 6711 105790 5833 4359
4 6218 9088 8617 5833 101886 6256
5 4130 5900 5547 4359 6256 83565

As we can see, a diagonal dominance occurs and inner products are large. The distant segments kernel has broad applicability. One of the applications is to use the distant segments kernel as the similarity operator of the support vector machine. svmlight, an implementation of the SVM algorithm, provides ways to plug in any kernel. You can download this package to use a precomputed matrix with svmlight. Note that you need to download svmlight separately here. Note also that svmlight is free for academic use.

For real-world bioinformatics application, the kernel matrix needs to be normalized. One of the reasons is that proteins share subsequences, but have a varying length. One way to normalize the matrix is to transform all associated vectors to vectors of unit norm. Here, in the last part of this tutorial, I show how to normalize the kernel matrix associated to the distant segments kernel. Let us compile the computer program Normalize.

seb@ubuntu:~$ wget http://genome.ulaval.ca/dav/boiseb01/pub/software/svmoptimize-v1.zip
seb@ubuntu:~$ unzip svmoptimize-v1.zip
seb@ubuntu:~$ gcc -O4 -Wall svmoptimize-v1/Normalize.c -o svmoptimize-v1/Normalize -lm

Again, executing the program without arguments will show the correct usage.

seb@ubuntu:~$ ./svmoptimize-v1/Normalize
usage: program matrix n
k'(x,y) <- k(x,y)/sqrt(k(x,x) k(y,y))

The normalized matrix is generated and written to KernelMatrix-DSK-1000-3.txt.1

seb@ubuntu:~$ ./svmoptimize-v1/Normalize KernelMatrix-DSK-1000-3.txt 6 > KernelMatrix-DSK-1000-3.txt.1
seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt.1
x 0 1 2 3 4 5
0 1.000000 0.060220 0.061922 0.050766 0.062897 0.046129
1 0.060220 1.000000 0.093533 0.054219 0.093049 0.066702
2 0.061922 0.093533 1.000000 0.064208 0.084008 0.059713
3 0.050766 0.054219 0.064208 1.000000 0.056184 0.046361
4 0.062897 0.093049 0.084008 0.056184 1.000000 0.067800
5 0.046129 0.066702 0.059713 0.046361 0.067800 1.000000

Using a normalized matrix keeps the same angle between proteins, but sets the norm of proteins to 1. Presumably, the angle between proteins is more important than the norm of proteins. This normalized matrix is suitable for supervised learning with the SVM, if labels are available. This concludes this tutorial.



References:

Sebastien Boisvert, Mario Marchand, Francois Laviolette, and Jacques Corbeil. Hiv-1 coreceptor usage prediction without multiple alignments: an application of string kernels. Retrovirology, 5(1):110, Dec 2008. [ bib | DOI | http ]

2008-12-02

My research activities

I started my master's degree in September. I also have a new web site. If everything goes as planned,I should start a PhD with professor Jacques Corbeil and professor Mario Marchand at the Université Laval by next year. So far, in my two first internships I focused on microarray for gene expression (Ubeda et al., Rochette et al.). Last year, in my third internship, I also had the chance to develop a new method for the prediction of HIV coreceptor usage with Mario Marchand, François Laviolette and Jacques Corbeil. Our manuscript was recently accepted for publication in the journal Retrovirology (to appear). Since existing methods for this particular task all need each V3 loop (a component of HIV) to be aligned, we saw a good opportunity to apply string kernels. Recall that a kernel is a similary measure and that it maps objects to a feature space to perform a dot product. Recall also that string kernels are the family of kernels whose input space is the set of strings. Alignments can be the cause of many problems. Just to name one, alignments break the i.i.d. hypothesis (each example are identically and independantly distributed according to an unknown, but constant distribution). In particular, cross-validation can not be performed on a set of sequences that were aligned to each other because the i.i.d. hypothesis is broken. Consequently, we developped the distant segments kernel (the source of inspiration for naming my blog), a kernel with a very large feature space. Its feature space is constructed by counting the distances between pairs of segments inside the primary structure of a protein. I also tested this kernel on the SCOP data set. With SVMlight (C=10) and the distant segments kernel (theta_m=3, delta_m=1000), I obtained very competitive results (i.g. ROC AUC). The mean AUC is 0.9265 for the 54 families. This is impressive because I did not applied any twisted changes to the diagonal of the kernel matrix. The kernel matrix was generated and feature vectors were implicitly normalized to unit norm. Also, positive examples were duplicated in the training sets until a balance (i.g. 1:1) is achieved. However, I don't think that the results on this data set are significant, for any algorithm, because the training and test sets are not build from the same distribution (see this paper). During the next year, I want to assign biological functions to some of the trypasomatid proteins with unknown function. To tackle this task, I will employ presumably kernel methods. However, the best approach has not yet been determined by our marvelous team. This problem can be interpreted as a ranking problem, as a supervised learning problem (see this data set), or even in a semi-supervised learning problem. Currently, I am working on the assembly and annotation of the genome of Leishmania tarentolae. Also, I maintain an online bibliography.

Jean-Philippe Vert

Jean-Philippe Vert is giving two talks in Québec this week. The first one will be on his contributions of machine learning in the field of bioinformatics. His second talk will be about biological networks.

Talk 1
Some contributions of machine learning in bioinformatics
December third, 2008, 10h30, room PLT-2744 (Pavillon Andrien-Pouliot)
see http://www.ulaval.ca/Al/interne/plan/AdrienPouliot/reference.htm

Many problems in bioinformatics can be formulated as pattern recognition problems on non-standard objects, such as strings, graphs or high-dimensional vectors with particular structure. They have triggered many original developments in machine learning recently, in particular in the way data are represented and prior knowledge is introduced in the algorithm. In this talk I will present some of these developments through several examples in microarray data analysis, virtual screening, and inference of biological networks.

Talk 2
Inferring and using biological networks
Decembre fourth, 2008, 15h00, Amphithéâtre Fisher, CRCHUL Porte TR-54
see http://maps.google.ca/maps?f=q&hl=en&geocode=&q=46.769017,+-71.281705&ie=UTF8&ll=46.769017,-71.281705&spn=0.001911,0.003648&t=h&z=18&iwloc=addr&om=1

Protein and gene networks have become important tools recently to represent
and think about the complexity of biological processes involving interactions
among many molecules. Two problems must nevertheless be solved in order to
make them useful: (i) how to obtain large-scale networks of good quality, and (ii)
how to use them in order to extract useful information from, e.g., expression data.
In this talk I will address both issues with a machine learning approach.

2008-04-23

First post

Hello to those reading this.

I updated my web site earlier today.

Recently I was amazed that my old blog is still available through some sort of creepy site whose purpose is to store forever blogs.

This old blog holds posts about Linux, my internships, my courses and thoughts. My friend Olivier once told me that he did read my blog in a recurrent manner. As I found that my old posts are still on Internet and as I enjoy writing about random things, I decided to start another blog, this time under the dskernel. dskernel stands for Distant Segments Kernel, that is my current research topic with collaborators at Université Laval. We aim to publish this work (soon enough) in the Bioinformatics journal. Talking about Research, I'll be starting off a master degree at Université Laval (see my web site). I'll be working on bioinformatics as it is the thing I enjoy the most in research. Of course, I will also be doing some nasty biology, such as genomics, transcriptomics and other *-omics.

As I said, my old blog is at the address.

This trimester I had ift603, bim606, bcm514 and bft600. I worked in the laboratory of Luc Gaudreau for my bioinformatics project (bft600). I am finishing my baccalaureate tomorrow (2008-04-23) at the Université de Sherbrooke. As for my 3 final tests, I think I did great in bim606 (Biologie moléculaire et cellulaire II). Moreover, I obtained the best score in the group in ift603 (Techniques d'apprentissage) which means I'll have a A+. However, this course is not in my degree so it's some sort of extra. Tomorrow I'll have my bcm514 test (Biochimie des protéines). I aim to have at least B in both bim606 and bcm514. As for ift603 and bft600, I think I'll get a A+ for both, which gives me a total of 11 A+ (if I counted correctly) for my B.Sc.. Amazing isn't it.

This summer, more precisely in August, I will be in Great Britain as a volunteer with Chantier Jeunesse. As for May, June and July, I have to decide soon either to work as a computational biologist or to relax for a while. Next, on thursday, my brother will help me to move all my stuff back in Québec.

This summer, I want to try to get on a "Long Board". I also want to figure out why the SVM soft-margin so-called objective function is quadratic and why is it trivial to find a solution, given a kernel matrix.

Upon completion of a baccalaureate in bioinformatics, I figured out that I have the most affinity with mathematics. Also, I have presented 2 conferences this trimester (in chronological order):

Club informatique - Noyaux et machines à vecteurs de support


and

Club informatique - Ruby on Rails, une architecture MVC facilitant le développement de systèmes d'information web

Finally, for the classic "new toys" section, I bought a eee PC (which I want to sell ASAP), a Quad-Core (for scientific purpose, like kernel matrix computation).
There was an error in this gadget