public class SimpleKMeans extends RandomizableClusterer implements NumberOfClustersRequestable, WeightedInstancesHandler, TechnicalInformationHandler
@inproceedings{Arthur2007, author = {D. Arthur and S. Vassilvitskii}, booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms}, pages = {1027-1035}, title = {k-means++: the advantages of carefull seeding}, year = {2007} }Valid options are:
-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
RandomizableClusterer
,
Serialized FormModifier and Type | Field and Description |
---|---|
static int |
CANOPY |
static int |
FARTHEST_FIRST |
static int |
KMEANS_PLUS_PLUS |
static int |
RANDOM |
static Tag[] |
TAGS_SELECTION
Initialization methods
|
Constructor and Description |
---|
SimpleKMeans()
the default constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
java.lang.String |
canopyMaxNumCanopiesToHoldInMemoryTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyMinimumCanopyDensityTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyPeriodicPruningRateTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyT1TipText()
Tip text for this property
|
java.lang.String |
canopyT2TipText()
Tip text for this property
|
int |
clusterInstance(Instance instance)
Classifies a given instance.
|
java.lang.String |
displayStdDevsTipText()
Returns the tip text for this property.
|
java.lang.String |
distanceFunctionTipText()
Returns the tip text for this property.
|
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
java.lang.String |
fastDistanceCalcTipText()
Returns the tip text for this property.
|
int[] |
getAssignments()
Gets the assignments for each instance.
|
int |
getCanopyMaxNumCanopiesToHoldInMemory()
Get the maximum number of candidate canopies to retain in memory during
training.
|
double |
getCanopyMinimumCanopyDensity()
Get the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
int |
getCanopyPeriodicPruningRate()
Get the how often to prune low density canopies during training (if using
canopy clustering)
|
double |
getCanopyT1()
Get the t1 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
double |
getCanopyT2()
Get the t2 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
Instances |
getClusterCentroids()
Gets the the cluster centroids.
|
double[][][] |
getClusterNominalCounts()
Returns for each cluster the weighted frequency counts for the values of each
nominal attribute.
|
double[] |
getClusterSizes()
Gets the sum of weights for all the instances in each cluster.
|
Instances |
getClusterStandardDevs()
Gets the standard deviations of the numeric attributes in each cluster.
|
boolean |
getDisplayStdDevs()
Gets whether standard deviations and nominal count.
|
DistanceFunction |
getDistanceFunction()
returns the distance function currently in use.
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
boolean |
getFastDistanceCalc()
Gets whether to use faster distance calculation.
|
SelectedTag |
getInitializationMethod()
Get the initialization method to use
|
int |
getMaxIterations()
gets the number of maximum iterations to be executed.
|
int |
getNumClusters()
gets the number of clusters to generate.
|
int |
getNumExecutionSlots()
Get the degree of parallelism to use.
|
java.lang.String[] |
getOptions()
Gets the current settings of SimpleKMeans.
|
boolean |
getPreserveInstancesOrder()
Gets whether order of instances must be preserved.
|
boolean |
getReduceNumberOfDistanceCalcsViaCanopies()
Get whether to use canopies to reduce the number of distance computations
required
|
java.lang.String |
getRevision()
Returns the revision string.
|
double |
getSquaredError()
Gets the squared error for all clusters.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer.
|
java.lang.String |
initializationMethodTipText()
Returns the tip text for this property.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args)
Main method for executing this class.
|
java.lang.String |
maxIterationsTipText()
Returns the tip text for this property.
|
int |
numberOfClusters()
Returns the number of clusters.
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property.
|
java.lang.String |
numExecutionSlotsTipText()
Returns the tip text for this property
|
java.lang.String |
preserveInstancesOrderTipText()
Returns the tip text for this property.
|
java.lang.String |
reduceNumberOfDistanceCalcsViaCanopiesTipText()
Returns the tip text for this property.
|
void |
setCanopyMaxNumCanopiesToHoldInMemory(int max)
Set the maximum number of candidate canopies to retain in memory during
training.
|
void |
setCanopyMinimumCanopyDensity(double dens)
Set the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
void |
setCanopyPeriodicPruningRate(int p)
Set the how often to prune low density canopies during training (if using
canopy clustering)
|
void |
setCanopyT1(double t1)
Set the t1 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
void |
setCanopyT2(double t2)
Set the t2 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
void |
setDisplayStdDevs(boolean stdD)
Sets whether standard deviations and nominal count.
|
void |
setDistanceFunction(DistanceFunction df)
sets the distance function to use for instance comparison.
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setFastDistanceCalc(boolean value)
Sets whether to use faster distance calculation.
|
void |
setInitializationMethod(SelectedTag method)
Set the initialization method to use
|
void |
setMaxIterations(int n)
set the maximum number of iterations to be executed.
|
void |
setNumClusters(int n)
set the number of clusters to generate.
|
void |
setNumExecutionSlots(int slots)
Set the degree of parallelism to use.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPreserveInstancesOrder(boolean r)
Sets whether order of instances must be preserved.
|
void |
setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
Set whether to use canopies to reduce the number of distance computations
required
|
java.lang.String |
toString()
return a string describing this clusterer.
|
getSeed, seedTipText, setSeed
debugTipText, distributionForInstance, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilities
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
makeCopy
public static final int RANDOM
public static final int KMEANS_PLUS_PLUS
public static final int CANOPY
public static final int FARTHEST_FIRST
public static final Tag[] TAGS_SELECTION
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandler
getTechnicalInformation
in interface TechnicalInformationHandler
public java.lang.String globalInfo()
public Capabilities getCapabilities()
getCapabilities
in interface Clusterer
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class AbstractClusterer
Capabilities
public void buildClusterer(Instances data) throws java.lang.Exception
buildClusterer
in interface Clusterer
buildClusterer
in class AbstractClusterer
data
- set of instances serving as training datajava.lang.Exception
- if the clusterer has not been generated successfullypublic int clusterInstance(Instance instance) throws java.lang.Exception
clusterInstance
in interface Clusterer
clusterInstance
in class AbstractClusterer
instance
- the instance to be assigned to a clusterjava.lang.Exception
- if instance could not be classified successfullypublic int numberOfClusters() throws java.lang.Exception
numberOfClusters
in interface Clusterer
numberOfClusters
in class AbstractClusterer
java.lang.Exception
- if number of clusters could not be returned successfullypublic java.util.Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
listOptions
in class RandomizableClusterer
public java.lang.String numClustersTipText()
public void setNumClusters(int n) throws java.lang.Exception
setNumClusters
in interface NumberOfClustersRequestable
n
- the number of clusters to generatejava.lang.Exception
- if number of clusters is negativepublic int getNumClusters()
public java.lang.String initializationMethodTipText()
public void setInitializationMethod(SelectedTag method)
method
- the initialization method to usepublic SelectedTag getInitializationMethod()
public java.lang.String reduceNumberOfDistanceCalcsViaCanopiesTipText()
public void setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
c
- true if canopies are to be used to reduce the number of distance
computationspublic boolean getReduceNumberOfDistanceCalcsViaCanopies()
public java.lang.String canopyPeriodicPruningRateTipText()
public void setCanopyPeriodicPruningRate(int p)
p
- how often (every p instances) to prune low density canopiespublic int getCanopyPeriodicPruningRate()
public java.lang.String canopyMinimumCanopyDensityTipText()
public void setCanopyMinimumCanopyDensity(double dens)
dens
- the minimum canopy densitypublic double getCanopyMinimumCanopyDensity()
public java.lang.String canopyMaxNumCanopiesToHoldInMemoryTipText()
public void setCanopyMaxNumCanopiesToHoldInMemory(int max)
max
- the maximum number of candidate canopies to retain in memory
during trainingpublic int getCanopyMaxNumCanopiesToHoldInMemory()
public java.lang.String canopyT2TipText()
public void setCanopyT2(double t2)
t2
- the t2 radius to usepublic double getCanopyT2()
public java.lang.String canopyT1TipText()
public void setCanopyT1(double t1)
t1
- the t1 radius to usepublic double getCanopyT1()
public java.lang.String maxIterationsTipText()
public void setMaxIterations(int n) throws java.lang.Exception
n
- the maximum number of iterationsjava.lang.Exception
- if maximum number of iteration is smaller than 1public int getMaxIterations()
public java.lang.String displayStdDevsTipText()
public void setDisplayStdDevs(boolean stdD)
stdD
- true if std. devs and counts should be displayedpublic boolean getDisplayStdDevs()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r
- true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public java.lang.String distanceFunctionTipText()
public DistanceFunction getDistanceFunction()
public void setDistanceFunction(DistanceFunction df) throws java.lang.Exception
df
- the new distance function to usejava.lang.Exception
- if instances cannot be processedpublic java.lang.String preserveInstancesOrderTipText()
public void setPreserveInstancesOrder(boolean r)
r
- true if missing values are to be replacedpublic boolean getPreserveInstancesOrder()
public java.lang.String fastDistanceCalcTipText()
public void setFastDistanceCalc(boolean value)
value
- true if faster calculation to be usedpublic boolean getFastDistanceCalc()
public java.lang.String numExecutionSlotsTipText()
public void setNumExecutionSlots(int slots)
slots
- the number of tasks to run in parallel when computing the
nearest neighbors and evaluating different values of k between the
lower and upper boundspublic int getNumExecutionSlots()
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
setOptions
in interface OptionHandler
setOptions
in class RandomizableClusterer
options
- the list of options as an array of stringsjava.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class RandomizableClusterer
public java.lang.String toString()
toString
in class java.lang.Object
public Instances getClusterCentroids()
public Instances getClusterStandardDevs()
public double[][][] getClusterNominalCounts()
public double getSquaredError()
m_FastDistanceCalc
public double[] getClusterSizes()
public int[] getAssignments() throws java.lang.Exception
java.lang.Exception
- if order of instances wasn't preserved or no assignments
were madepublic java.lang.String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class AbstractClusterer
public static void main(java.lang.String[] args)
args
- use -h to list all parameters