public class StringKernel extends Kernel implements TechnicalInformationHandler
@article{Lodhi2002, author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins}, journal = {Journal of Machine Learning Research}, pages = {419-444}, title = {Text Classification using String Kernels}, volume = {2}, year = {2002}, HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html} } @techreport{Kleedorfer2005, address = {Wien, Austria}, author = {F. Kleedorfer and A. Seewald}, institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence}, number = {TR-2005-13}, title = {Implementation of a String Kernel for WEKA}, year = {2005} }Valid options are:
-D Enables debugging output (if available) to be printed. (default: off)
-P <0|1> The pruning method to use: 0 = No pruning 1 = Lambda pruning (default: 0)
-C <num> The size of the cache (a prime number). (default: 250007)
-IC <num> The size of the internal cache (a prime number). (default: 200003)
-L <num> The lambda constant. Penalizes non-continuous subsequence matches. Must be in (0,1). (default: 0.5)
-ssl <num> The length of the subsequence. (default: 3)
-ssl-max <num> The maximum length of the subsequence. (default: 9)
-N Use normalization. (default: no)
m
, as
parameter. Lambda pruning causes very 'stretched' substring matches not to be
counted, thus speeding up the computation. The functionality of SSK and
SSK-LP is explained in the following using simple examples.
k=2 lambda=0.5 m=8 (for SSK-LP examples)
SSK(2,"ab","axb")=0.5^5 = 0,03125There is one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is computed by raising lambda to the power of L, where L is the length of the subsequence match in the one string plus the length of the subsequence match in the other, in our case:
ab axb L= 2 + 3 = 5hence, the kernel yields 0.5^5 = 0,03125
SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375Here, we also have one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is actually computed by summing over all values computed for each occurrence of a common subsequence match. In this example, there are two possible cases:
ab abb -- -- L=4 -- - - L=5we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, so we get the result 0.5^5 + 0.5^4 = 0,09375.
lambda ^(length[match_in_s] + length[match_in_t]
) . Tests have
shown that a tremendous speedup can be achieved using this technique while
suffering from very little quality loss. SSK(2,"ab","axb")=0.5^14 = 0,00006103515625lambda pruning allows for the control of the match length. So, if m (the maximum lambda exponent) is e.g. 8, these two strings would yield a kernel value of 0:
with lambda pruning: SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0 without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625This is because the exponent for lambda (=the length of the subsequence match) would be 14, which is > 8. In Contrast, the next result is > 0
m=8 SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625because the lambda exponent would be 8, which is just accepted by lambda pruning.
SSK(2,"ab","ab")=0.5^4 = 0.0625 SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478SSK is evaluated twice, each time for two identical strings. A good measure of similarity would produce the same value in both cases, which should indicate the same level of similarity. The value of the normalized SSK would be 1.0 in both cases. So for the purpose of computing string similarity the normalized kernel should be used. For SVM the unnormalized kernel is usually sufficient.
O(k*|s|*|t|)Lambda Pruning has a complexity (without caching) of
O(m*binom(m,k)^2*(|s|+n)*|t|)
k... subsequence length (ssl) s,t... strings |s|... length of string s binom(x,y)... binomial coefficient (x!/[(x-y)!y!]) m... maxLambdaExponent (ssl-max)Keep in mind that execution time can increase fast for long strings and big values for k, especially if you don't use lambda pruning. With lambda pruning, computation is usually so fast that switching on the cache leads to slower computation because of setup costs. Therefore caching is switched off for lambda pruning.
+----------+-------------+---------------+ |attribute#| 0 | 1 | +----------+-------------+---------------+ | content | [text data] | [class label] | +----------------------------------------+ ... or ... +----------+---------------+-------------+ |attribute#| 0 | 1 | +----------+---------------+-------------+ | content | [class label] | [text data] | +----------------------------------------+
Modifier and Type | Field and Description |
---|---|
static int |
PRUNING_LAMBDA
Pruning method: Lambda See [2] for details.
|
static int |
PRUNING_NONE
Pruning method: No Pruning
|
static Tag[] |
TAGS_PRUNING
Pruning methods
|
Constructor and Description |
---|
StringKernel()
default constructor
|
StringKernel(Instances data,
int cacheSize,
int subsequenceLength,
double lambda,
boolean debug)
creates a new StringKernel object.
|
Modifier and Type | Method and Description |
---|---|
void |
buildKernel(Instances data)
builds the kernel with the given data.
|
java.lang.String |
cacheSizeTipText()
Returns the tip text for this property
|
void |
clean()
Frees the memory used by the kernel.
|
double |
eval(int id1,
int id2,
Instance inst1)
Computes the result of the kernel function for two instances.
|
int |
getCacheSize()
Gets the size of the cache
|
Capabilities |
getCapabilities()
Returns the Capabilities of this kernel.
|
int |
getInternalCacheSize()
Gets the size of the internal cache
|
double |
getLambda()
Gets the lambda constant used in the string kernel
|
int |
getMaxSubsequenceLength()
Returns the maximum length of the subsequence
|
java.lang.String[] |
getOptions()
Gets the current settings of the Kernel.
|
SelectedTag |
getPruningMethod()
Gets the method used for pruning.
|
java.lang.String |
getRevision()
Returns the revision string.
|
int |
getSubsequenceLength()
Returns the length of the subsequence
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed
information about the technical background of this class, e.g., paper
reference or book this class is based on.
|
boolean |
getUseNormalization()
Returns whether normalization is used.
|
java.lang.String |
globalInfo()
Returns a string describing the kernel
|
java.lang.String |
internalCacheSizeTipText()
Returns the tip text for this property
|
java.lang.String |
lambdaTipText()
Returns the tip text for this property
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
java.lang.String |
maxSubsequenceLengthTipText()
Returns the tip text for this property
|
double |
normalizedKernel(char[] s,
char[] t)
evaluates the normalized kernel between s and t.
|
int |
numCacheHits()
Returns the number of dot product cache hits.
|
int |
numEvals()
Returns the number of kernel evaluation performed.
|
java.lang.String |
pruningMethodTipText()
Returns the tip text for this property
|
void |
setCacheSize(int value)
Sets the size of the cache to use (a prime number)
|
void |
setInternalCacheSize(int value)
sets the size of the internal cache for intermediate results.
|
void |
setLambda(double value)
Sets the lambda constant used in the string kernel
|
void |
setMaxSubsequenceLength(int value)
Sets the maximum length of the subsequence.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPruningMethod(SelectedTag value)
Sets the method used to for pruning.
|
void |
setSubsequenceLength(int value)
Sets the length of the subsequence.
|
void |
setUseNormalization(boolean value)
Sets whether to use normalization.
|
java.lang.String |
subsequenceLengthTipText()
Returns the tip text for this property
|
double |
unnormalizedKernel(char[] s,
char[] t)
evaluates the unnormalized kernel between s and t.
|
java.lang.String |
useNormalizationTipText()
Returns the tip text for this property
|
debugTipText, forName, getChecksTurnedOff, getDebug, getDoNotCheckCapabilities, makeCopies, makeCopy, setChecksTurnedOff, setDebug, setDoNotCheckCapabilities
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
makeCopy
public static final int PRUNING_NONE
public static final int PRUNING_LAMBDA
public static final Tag[] TAGS_PRUNING
public StringKernel()
public StringKernel(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug) throws java.lang.Exception
data
- the dataset to usecacheSize
- the size of the cachesubsequenceLength
- the subsequence lengthlambda
- the lambda valuedebug
- whether to output debug informationjava.lang.Exception
- if something goes wrongpublic java.lang.String globalInfo()
globalInfo
in class Kernel
public TechnicalInformation getTechnicalInformation()
getTechnicalInformation
in interface TechnicalInformationHandler
public java.util.Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
listOptions
in class Kernel
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-D Enables debugging output (if available) to be printed. (default: off)
-P <0|1> The pruning method to use: 0 = No pruning 1 = Lambda pruning (default: 0)
-C <num> The size of the cache (a prime number). (default: 250007)
-IC <num> The size of the internal cache (a prime number). (default: 200003)
-L <num> The lambda constant. Penalizes non-continuous subsequence matches. Must be in (0,1). (default: 0.5)
-ssl <num> The length of the subsequence. (default: 3)
-ssl-max <num> The maximum length of the subsequence. (default: 9)
-N Use normalization. (default: no)
setOptions
in interface OptionHandler
setOptions
in class Kernel
options
- the list of options as an array of stringsjava.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class Kernel
public java.lang.String pruningMethodTipText()
public void setPruningMethod(SelectedTag value)
value
- the pruning method to use.public SelectedTag getPruningMethod()
public void setCacheSize(int value)
value
- the size of the cachepublic int getCacheSize()
public java.lang.String cacheSizeTipText()
public void setInternalCacheSize(int value)
value
- the size of the internal cachepublic int getInternalCacheSize()
public java.lang.String internalCacheSizeTipText()
public void setSubsequenceLength(int value)
value
- the lengthpublic int getSubsequenceLength()
public java.lang.String subsequenceLengthTipText()
public void setMaxSubsequenceLength(int value)
value
- the maximum lengthpublic int getMaxSubsequenceLength()
public java.lang.String maxSubsequenceLengthTipText()
public void setLambda(double value)
value
- the lambda value to usepublic double getLambda()
public java.lang.String lambdaTipText()
public void setUseNormalization(boolean value)
value
- whether to use normalizationpublic boolean getUseNormalization()
public java.lang.String useNormalizationTipText()
public double eval(int id1, int id2, Instance inst1) throws java.lang.Exception
eval
in class Kernel
id1
- the index of the first instance in the datasetid2
- the index of the second instance in the datasetinst1
- the instance corresponding to id1 (used if id1 == -1)java.lang.Exception
- if something goes wrongpublic void clean()
public int numEvals()
public int numCacheHits()
numCacheHits
in class Kernel
public double normalizedKernel(char[] s, char[] t)
s
- first input stringt
- second input stringpublic double unnormalizedKernel(char[] s, char[] t)
s
- first input stringt
- second input stringpublic Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class Kernel
Capabilities
public void buildKernel(Instances data) throws java.lang.Exception
buildKernel
in class Kernel
data
- the data to base the kernel onjava.lang.Exception
- if something goes wrong, e.g., the data does not consist
of one string attribute and the classpublic java.lang.String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class Kernel