StringKernel

java.lang.Object
- weka.classifiers.functions.supportVector.Kernel
- - weka.classifiers.functions.supportVector.StringKernel

All Implemented Interfaces:

java.io.Serializable, CapabilitiesHandler, OptionHandler, RevisionHandler, TechnicalInformationHandler
```
public class StringKernel
extends Kernel
implements TechnicalInformationHandler
```
Implementation of the subsequence kernel (SSK) as described in [1] and of the subsequence kernel with lambda pruning (SSK-LP) as described in [2].

For more information, see

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.

F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria.
BibTeX:
```
 @article{Lodhi2002,
    author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins},
    journal = {Journal of Machine Learning Research},
    pages = {419-444},
    title = {Text Classification using String Kernels},
    volume = {2},
    year = {2002},
    HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html}
 }
 
 @techreport{Kleedorfer2005,
    address = {Wien, Austria},
    author = {F. Kleedorfer and A. Seewald},
    institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence},
    number = {TR-2005-13},
    title = {Implementation of a String Kernel for WEKA},
    year = {2005}
 }
 
```
Valid options are:
```
 -D
  Enables debugging output (if available) to be printed.
  (default: off)
```
```
 -no-checks
  Turns off all checks - use with caution!
  (default: checks on)
```
```
 -P <0|1>
  The pruning method to use:
  0 = No pruning
  1 = Lambda pruning
  (default: 0)
```
```
 -C <num>
  The size of the cache (a prime number).
  (default: 250007)
```
```
 -IC <num>
  The size of the internal cache (a prime number).
  (default: 200003)
```
```
 -L <num>
  The lambda constant. Penalizes non-continuous subsequence
  matches. Must be in (0,1).
  (default: 0.5)
```
```
 -ssl <num>
  The length of the subsequence.
  (default: 3)
```
```
 -ssl-max <num>
  The maximum length of the subsequence.
  (default: 9)
```
```
 -N
  Use normalization.
  (default: no)
```
Theory

Overview
The algorithm computes a measure of similarity between two texts based on the number and form of their common subsequences, which need not be contiguous. This method can be parametrized by specifying the subsequence length k, the penalty factor lambda, which penalizes non-contiguous matches, and optional 'lambda pruning', which takes maxLambdaExponent, m, as parameter. Lambda pruning causes very 'stretched' substring matches not to be counted, thus speeding up the computation. The functionality of SSK and SSK-LP is explained in the following using simple examples.
Explanation & Examples
for all of the following examples, we assume these parameter values:
```
 
k=2
lambda=0.5
m=8 (for SSK-LP examples)
```
SSK

Example 1
```
SSK(2,"ab","axb")=0.5^5 = 0,03125
```
There is one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is computed by raising lambda to the power of L, where L is the length of the subsequence match in the one string plus the length of the subsequence match in the other, in our case:
```
   ab    axb
L= 2  +   3 = 5
 
```
hence, the kernel yields 0.5^5 = 0,03125
Example 2
```
SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375
```
Here, we also have one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is actually computed by summing over all values computed for each occurrence of a common subsequence match. In this example, there are two possible cases:
```
ab    abb
--    --  L=4
--    - - L=5
 
```
we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, so we get the result 0.5^5 + 0.5^4 = 0,09375.
SSK-LP
Without lambda pruning, the string kernel finds *all* common subsequences of the given length, whereas with lambda pruning, common subsequence matches that are too much stretched in both strings are not taken into account. It is argued that the value yielded for such a common subsequence is too low (lambda ^(length[match_in_s] + length[match_in_t]) . Tests have shown that a tremendous speedup can be achieved using this technique while suffering from very little quality loss.
Lambda pruning is parametrized by the maximum lambda exponent. It is a good idea to choose that value to be about 3 or 4 times the subsequence length as a rule of thumb. YMMV.
Example 3
Without lambda pruning, one common subsequence, "AB" would be found in the following two strings. (With k=2)
```
SSK(2,"ab","axb")=0.5^14 = 0,00006103515625
```
lambda pruning allows for the control of the match length. So, if m (the maximum lambda exponent) is e.g. 8, these two strings would yield a kernel value of 0:
```
with lambda pruning:    SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0
without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625  
```
This is because the exponent for lambda (=the length of the subsequence match) would be 14, which is > 8. In Contrast, the next result is > 0
```
m=8
SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625
```
because the lambda exponent would be 8, which is just accepted by lambda pruning.
Normalization
When the string kernel is used for its main purpose, as the kernel of a support vector machine, it is not normalized. The normalized kernel can be switched on by -F (feature space normalization) but is much slower. Like most unnormalized kernels, K(x,x) is not a fixed value, see the next example.
Example 4
```
SSK(2,"ab","ab")=0.5^4 = 0.0625
SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478
```
SSK is evaluated twice, each time for two identical strings. A good measure of similarity would produce the same value in both cases, which should indicate the same level of similarity. The value of the normalized SSK would be 1.0 in both cases. So for the purpose of computing string similarity the normalized kernel should be used. For SVM the unnormalized kernel is usually sufficient.
Complexity of SSK and SSK-LP
The time complexity of this method (without lambda pruning and with an infinitely large cache) is
```
O(k*|s|*|t|)
```
Lambda Pruning has a complexity (without caching) of
```
O(m*binom(m,k)^2*(|s|+n)*|t|)
```
```
k...          subsequence length (ssl)
s,t...        strings
|s|...        length of string s
binom(x,y)... binomial coefficient (x!/[(x-y)!y!])
m...          maxLambdaExponent (ssl-max)
```
Keep in mind that execution time can increase fast for long strings and big values for k, especially if you don't use lambda pruning. With lambda pruning, computation is usually so fast that switching on the cache leads to slower computation because of setup costs. Therefore caching is switched off for lambda pruning.

For details and qualitative experiments about SSK, see [1]
For details about lambda pruning and performance comparison of SSK and SSK-LP (SSK with lambda pruning), see [2] Note that the complexity estimation in [2] assumes no caching of intermediate results, which has been implemented in the meantime and greatly improves the speed of the SSK without lambda pruning.

Notes for usage within Weka
Only instances of the following form can be processed using string kernels:
```
+----------+-------------+---------------+
|attribute#|     0       |       1       |
+----------+-------------+---------------+
| content  | [text data] | [class label] |
+----------------------------------------+
 ... or ...
+----------+---------------+-------------+
|attribute#|     0         |     1       |
+----------+---------------+-------------+
| content  | [class label] | [text data] |
+----------------------------------------+
```
Version:

$Revision: 5518 $

Author:

Florian Kleedorfer (kleedorfer@austria.fm), Alexander K. Seewald (alex@seewald.at)

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`PRUNING_LAMBDA` Pruning method: Lambda See [2] for details.
`static int`	`PRUNING_NONE` Pruning method: No Pruning
`static Tag[]`	`TAGS_PRUNING` Pruning methods

Constructor Summary

Constructors
Constructor and Description
`StringKernel()` default constructor
`StringKernel(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug)` creates a new StringKernel object.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`buildKernel(Instances data)` builds the kernel with the given data.
`java.lang.String`	`cacheSizeTipText()` Returns the tip text for this property
`void`	`clean()` Frees the memory used by the kernel.
`double`	`eval(int id1, int id2, Instance inst1)` Computes the result of the kernel function for two instances.
`int`	`getCacheSize()` Gets the size of the cache
`Capabilities`	`getCapabilities()` Returns the Capabilities of this kernel.
`int`	`getInternalCacheSize()` Gets the size of the internal cache
`double`	`getLambda()` Gets the lambda constant used in the string kernel
`int`	`getMaxSubsequenceLength()` Returns the maximum length of the subsequence
`java.lang.String[]`	`getOptions()` Gets the current settings of the Kernel.
`SelectedTag`	`getPruningMethod()` Gets the method used for pruning.
`java.lang.String`	`getRevision()` Returns the revision string.
`int`	`getSubsequenceLength()` Returns the length of the subsequence
`TechnicalInformation`	`getTechnicalInformation()` Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
`boolean`	`getUseNormalization()` Returns whether normalization is used.
`java.lang.String`	`globalInfo()` Returns a string describing the kernel
`java.lang.String`	`internalCacheSizeTipText()` Returns the tip text for this property
`java.lang.String`	`lambdaTipText()` Returns the tip text for this property
`java.util.Enumeration`	`listOptions()` Returns an enumeration describing the available options.
`java.lang.String`	`maxSubsequenceLengthTipText()` Returns the tip text for this property
`double`	`normalizedKernel(char[] s, char[] t)` evaluates the normalized kernel between s and t.
`int`	`numCacheHits()` Returns the number of dot product cache hits.
`int`	`numEvals()` Returns the number of kernel evaluation performed.
`java.lang.String`	`pruningMethodTipText()` Returns the tip text for this property
`void`	`setCacheSize(int value)` Sets the size of the cache to use (a prime number)
`void`	`setInternalCacheSize(int value)` sets the size of the internal cache for intermediate results.
`void`	`setLambda(double value)` Sets the lambda constant used in the string kernel
`void`	`setMaxSubsequenceLength(int value)` Sets the maximum length of the subsequence.
`void`	`setOptions(java.lang.String[] options)` Parses a given list of options.
`void`	`setPruningMethod(SelectedTag value)` Sets the method used to for pruning.
`void`	`setSubsequenceLength(int value)` Sets the length of the subsequence.
`void`	`setUseNormalization(boolean value)` Sets whether to use normalization.
`java.lang.String`	`subsequenceLengthTipText()` Returns the tip text for this property
`double`	`unnormalizedKernel(char[] s, char[] t)` evaluates the unnormalized kernel between s and t.
`java.lang.String`	`useNormalizationTipText()` Returns the tip text for this property

Methods inherited from class weka.classifiers.functions.supportVector.Kernel
checksTurnedOffTipText, debugTipText, forName, getChecksTurnedOff, getDebug, makeCopies, makeCopy, setChecksTurnedOff, setDebug

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PRUNING_NONE
```
public static final int PRUNING_NONE
```
    Pruning method: No Pruning
    
    See Also:
    
    Constant Field Values
  - PRUNING_LAMBDA
```
public static final int PRUNING_LAMBDA
```
    Pruning method: Lambda See [2] for details.
    
    See Also:
    
    Constant Field Values
  - TAGS_PRUNING
```
public static final Tag[] TAGS_PRUNING
```
    Pruning methods
- Constructor Detail
  - StringKernel
```
public StringKernel()
```
    default constructor
  - StringKernel
```
public StringKernel(Instances data,
                    int cacheSize,
                    int subsequenceLength,
                    double lambda,
                    boolean debug)
             throws java.lang.Exception
```
    creates a new StringKernel object. Initializes the kernel cache and the 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to lambda^MAX_POWER_OF_LAMBDA
    
    Parameters:
    
    data - the dataset to use
    
    cacheSize - the size of the cache
    
    subsequenceLength - the subsequence length
    
    lambda - the lambda value
    
    debug - whether to output debug information
    
    Throws:
    
    java.lang.Exception - if something goes wrong
- Method Detail
  - globalInfo
```
public java.lang.String globalInfo()
```
    Returns a string describing the kernel
    
    Specified by:
    
    globalInfo in class Kernel
    
    Returns:
    
    a description suitable for displaying in the explorer/experimenter gui
  - getTechnicalInformation
```
public TechnicalInformation getTechnicalInformation()
```
    Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
    
    Specified by:
    
    getTechnicalInformation in interface TechnicalInformationHandler
    
    Returns:
    
    the technical information about this class
  - listOptions
```
public java.util.Enumeration listOptions()
```
    Returns an enumeration describing the available options.
    
    Specified by:
    
    listOptions in interface OptionHandler
    
    Overrides:
    
    listOptions in class Kernel
    
    Returns:
    
    an enumeration of all the available options.
  - setOptions
```
public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
```
    Parses a given list of options.
    Valid options are:
```
 -D
  Enables debugging output (if available) to be printed.
  (default: off)
```
```
 -no-checks
  Turns off all checks - use with caution!
  (default: checks on)
```
```
 -P <0|1>
  The pruning method to use:
  0 = No pruning
  1 = Lambda pruning
  (default: 0)
```
```
 -C <num>
  The size of the cache (a prime number).
  (default: 250007)
```
```
 -IC <num>
  The size of the internal cache (a prime number).
  (default: 200003)
```
```
 -L <num>
  The lambda constant. Penalizes non-continuous subsequence
  matches. Must be in (0,1).
  (default: 0.5)
```
```
 -ssl <num>
  The length of the subsequence.
  (default: 3)
```
```
 -ssl-max <num>
  The maximum length of the subsequence.
  (default: 9)
```
```
 -N
  Use normalization.
  (default: no)
```
    Specified by:
    
    setOptions in interface OptionHandler
    
    Overrides:
    
    setOptions in class Kernel
    
    Parameters:
    
    options - the list of options as an array of strings
    
    Throws:
    
    java.lang.Exception - if an option is not supported
  - getOptions
```
public java.lang.String[] getOptions()
```
    Gets the current settings of the Kernel.
    
    Specified by:
    
    getOptions in interface OptionHandler
    
    Overrides:
    
    getOptions in class Kernel
    
    Returns:
    
    an array of strings suitable for passing to setOptions
  - pruningMethodTipText
```
public java.lang.String pruningMethodTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setPruningMethod
```
public void setPruningMethod(SelectedTag value)
```
    Sets the method used to for pruning.
    
    Parameters:
    
    value - the pruning method to use.
  - getPruningMethod
```
public SelectedTag getPruningMethod()
```
    Gets the method used for pruning.
    
    Returns:
    
    the pruning method to use.
  - setCacheSize
```
public void setCacheSize(int value)
```
    Sets the size of the cache to use (a prime number)
    
    Parameters:
    
    value - the size of the cache
  - getCacheSize
```
public int getCacheSize()
```
    Gets the size of the cache
    
    Returns:
    
    the cache size
  - cacheSizeTipText
```
public java.lang.String cacheSizeTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setInternalCacheSize
```
public void setInternalCacheSize(int value)
```
    sets the size of the internal cache for intermediate results. Memory consumption is about 16x this amount in bytes. Only use when lambda pruning is switched off.
    
    Parameters:
    
    value - the size of the internal cache
  - getInternalCacheSize
```
public int getInternalCacheSize()
```
    Gets the size of the internal cache
    
    Returns:
    
    the cache size
  - internalCacheSizeTipText
```
public java.lang.String internalCacheSizeTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setSubsequenceLength
```
public void setSubsequenceLength(int value)
```
    Sets the length of the subsequence.
    
    Parameters:
    
    value - the length
  - getSubsequenceLength
```
public int getSubsequenceLength()
```
    Returns the length of the subsequence
    
    Returns:
    
    the length
  - subsequenceLengthTipText
```
public java.lang.String subsequenceLengthTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setMaxSubsequenceLength
```
public void setMaxSubsequenceLength(int value)
```
    Sets the maximum length of the subsequence.
    
    Parameters:
    
    value - the maximum length
  - getMaxSubsequenceLength
```
public int getMaxSubsequenceLength()
```
    Returns the maximum length of the subsequence
    
    Returns:
    
    the maximum length
  - maxSubsequenceLengthTipText
```
public java.lang.String maxSubsequenceLengthTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setLambda
```
public void setLambda(double value)
```
    Sets the lambda constant used in the string kernel
    
    Parameters:
    
    value - the lambda value to use
  - getLambda
```
public double getLambda()
```
    Gets the lambda constant used in the string kernel
    
    Returns:
    
    the current lambda constant
  - lambdaTipText
```
public java.lang.String lambdaTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setUseNormalization
```
public void setUseNormalization(boolean value)
```
    Sets whether to use normalization. Each time this value is changed, the kernel cache is cleared.
    
    Parameters:
    
    value - whether to use normalization
  - getUseNormalization
```
public boolean getUseNormalization()
```
    Returns whether normalization is used.
    
    Returns:
    
    true if normalization is used
  - useNormalizationTipText
```
public java.lang.String useNormalizationTipText()
```
    Returns the tip text for this property
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - eval
```
public double eval(int id1,
                   int id2,
                   Instance inst1)
            throws java.lang.Exception
```
    Computes the result of the kernel function for two instances. If id1 == -1, eval use inst1 instead of an instance in the dataset.
    
    Specified by:
    
    eval in class Kernel
    
    Parameters:
    
    id1 - the index of the first instance in the dataset
    
    id2 - the index of the second instance in the dataset
    
    inst1 - the instance corresponding to id1 (used if id1 == -1)
    
    Returns:
    
    the result of the kernel function
    
    Throws:
    
    java.lang.Exception - if something goes wrong
  - clean
```
public void clean()
```
    Frees the memory used by the kernel. (Useful with kernels which use cache.) This function is called when the training is done. i.e. after that, eval will be called with id1 == -1.
    
    Specified by:
    
    clean in class Kernel
  - numEvals
```
public int numEvals()
```
    Returns the number of kernel evaluation performed.
    
    Specified by:
    
    numEvals in class Kernel
    
    Returns:
    
    the number of kernel evaluation performed.
  - numCacheHits
```
public int numCacheHits()
```
    Returns the number of dot product cache hits.
    
    Specified by:
    
    numCacheHits in class Kernel
    
    Returns:
    
    the number of dot product cache hits, or -1 if not supported by this kernel.
  - normalizedKernel
```
public double normalizedKernel(char[] s,
                               char[] t)
```
    evaluates the normalized kernel between s and t. See [1] for details about the normalized SSK.
    
    Parameters:
    
    s - first input string
    
    t - second input string
    
    Returns:
    
    a double indicating their distance, or similarity
  - unnormalizedKernel
```
public double unnormalizedKernel(char[] s,
                                 char[] t)
```
    evaluates the unnormalized kernel between s and t. See [1] for details about the unnormalized SSK.
    
    Parameters:
    
    s - first input string
    
    t - second input string
    
    Returns:
    
    a double indicating their distance, or similarity
  - getCapabilities
```
public Capabilities getCapabilities()
```
    Returns the Capabilities of this kernel.
    
    Specified by:
    
    getCapabilities in interface CapabilitiesHandler
    
    Overrides:
    
    getCapabilities in class Kernel
    
    Returns:
    
    the capabilities of this object
    
    See Also:
    
    Capabilities
  - buildKernel
```
public void buildKernel(Instances data)
                 throws java.lang.Exception
```
    builds the kernel with the given data.
    
    Overrides:
    
    buildKernel in class Kernel
    
    Parameters:
    
    data - the data to base the kernel on
    
    Throws:
    
    java.lang.Exception - if something goes wrong, e.g., the data does not consist of one string attribute and the class
  - getRevision
```
public java.lang.String getRevision()
```
    Returns the revision string.
    
    Specified by:
    
    getRevision in interface RevisionHandler
    
    Overrides:
    
    getRevision in class Kernel
    
    Returns:
    
    the revision

Class StringKernel

Theory

Overview

Explanation & Examples

SSK

Example 1

Example 2

SSK-LP

Example 3

Normalization

Example 4

Complexity of SSK and SSK-LP

Notes for usage within Weka

Field Summary

Constructor Summary

Method Summary

Methods inherited from class weka.classifiers.functions.supportVector.Kernel

Methods inherited from class java.lang.Object

Field Detail

PRUNING_NONE

PRUNING_LAMBDA

TAGS_PRUNING

Constructor Detail

StringKernel

StringKernel

Method Detail

globalInfo

getTechnicalInformation

listOptions

setOptions

getOptions

pruningMethodTipText

setPruningMethod

getPruningMethod

setCacheSize

getCacheSize

cacheSizeTipText

setInternalCacheSize

getInternalCacheSize

internalCacheSizeTipText

setSubsequenceLength

getSubsequenceLength

subsequenceLengthTipText

setMaxSubsequenceLength

getMaxSubsequenceLength

maxSubsequenceLengthTipText

setLambda

getLambda

lambdaTipText

setUseNormalization

getUseNormalization

useNormalizationTipText

eval

clean

numEvals

numCacheHits

normalizedKernel

unnormalizedKernel

getCapabilities

buildKernel

getRevision