public class Canopy extends RandomizableClusterer implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler
@inproceedings{McCallum2000,
author = {A. McCallum and K. Nigam and L.H. Ungar},
booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms},
pages = {169-178},
title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching},
year = {2000}
}
Valid options are:
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
| Modifier and Type | Field and Description |
|---|---|
static double |
DEFAULT_T1 |
static double |
DEFAULT_T2 |
| Constructor and Description |
|---|
Canopy() |
| Modifier and Type | Method and Description |
|---|---|
static Canopy |
aggregateCanopies(java.util.List<Canopy> canopies,
double aggregationT1,
double aggregationT2,
NormalizableDistance finalDistanceFunction,
Filter missingValuesReplacer,
int finalNumCanopies)
Aggregate the canopies from a list of Canopy clusterers together into one
final model.
|
long[] |
assignCanopies(Instance inst)
Uses T1 distance to assign canopies to the supplied instance.
|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
void |
cleanUp()
Save memory
|
double[] |
distributionForInstance(Instance instance)
Predicts the cluster memberships for a given instance.
|
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
double |
getActualT1()
Get the actual value of T1 (which may be different from the initial value
if the heuristic is used)
|
double |
getActualT2()
Get the actual value of T2 (which may be different from the initial value
if the heuristic is used)
|
Instances |
getCanopies()
Get the canopies (cluster centers).
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
java.util.List<long[]> |
getClusterCanopyAssignments()
Get the canopies that each canopy (cluster center) is within T1 distance of
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
int |
getMaxNumCandidateCanopiesToHoldInMemory()
Get the maximum number of candidate canopies to retain in memory during
training.
|
double |
getMinimumCanopyDensity()
Get the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
int |
getNumClusters()
Get the number of clusters to generate
|
java.lang.String[] |
getOptions()
Gets the current settings of Canopy.
|
int |
getPeriodicPruningRate()
Get the how often to prune low density canopies during training
|
double |
getT1()
Get the T1 distance.
|
double |
getT2()
Get the T2 distance to use.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer.
|
void |
initializeDistanceFunction(Instances init)
Initialize the distance function (i.e set min/max values for numeric
attributes) with the supplied instances.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args) |
java.lang.String |
maxNumCandidateCanopiesToHoldInMemory()
Returns the tip text for this property.
|
java.lang.String |
minimumCanopyDensityTipText()
Returns the tip text for this property.
|
static boolean |
nonEmptyCanopySetIntersection(long[] first,
long[] second)
Tests if two sets of canopies have a non-empty intersection
|
int |
numberOfClusters()
Returns the number of clusters.
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningRateTipText()
Returns the tip text for this property.
|
static java.lang.String |
printCanopyAssignments(Instances dataPoints,
java.util.List<long[]> canopyAssignments)
Print the supplied instances and their canopies
|
static java.lang.String |
printSingleAssignment(long[] assignments) |
void |
setCanopies(Instances canopies)
Set the canopies to use (replaces any learned by this clusterer already)
|
void |
setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
Set the canopies that each canopy (cluster center) is within T1 distance of
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setMaxNumCandidateCanopiesToHoldInMemory(int max)
Set the maximum number of candidate canopies to retain in memory during
training.
|
void |
setMinimumCanopyDensity(double dens)
Set the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
void |
setMissingValuesReplacer(Filter missingReplacer)
Set a ready-to-use missing values replacement filter
|
void |
setNumClusters(int numClusters)
Set the number of clusters to generate
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPeriodicPruningRate(int p)
Set the how often to prune low density canopies during training
|
void |
setT1(double t1)
Set the T1 distance.
|
void |
setT2(double t2)
Set the T2 distance to use.
|
java.lang.String |
t1TipText()
Tip text for this property
|
java.lang.String |
t2TipText()
Tip text for this property
|
java.lang.String |
toString() |
java.lang.String |
toString(boolean header)
Return a textual description of this clusterer
|
void |
updateClusterer(Instance newInstance)
Adds an instance to the clusterer.
|
void |
updateFinished()
Signals the end of the updating.
|
getSeed, seedTipText, setSeedclusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilitiesequals, getClass, hashCode, notify, notifyAll, wait, wait, waitmakeCopypublic static final double DEFAULT_T2
public static final double DEFAULT_T1
public java.lang.String globalInfo()
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandlergetTechnicalInformation in interface TechnicalInformationHandlerpublic Capabilities getCapabilities()
getCapabilities in interface ClusterergetCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractClustererCapabilitiespublic java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClustererpublic void setOptions(java.lang.String[] options)
throws java.lang.Exception
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
setOptions in interface OptionHandlersetOptions in class RandomizableClustereroptions - the list of options as an array of strings throws Exception
if an option is not supportedjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClustererpublic static boolean nonEmptyCanopySetIntersection(long[] first,
long[] second)
throws java.lang.Exception
first - the first canopy setsecond - the second canopy setjava.lang.Exception - if a problem occurspublic long[] assignCanopies(Instance inst) throws java.lang.Exception
inst - the instance to find covering canopies forjava.lang.Exception - if a problem occurspublic void updateClusterer(Instance newInstance) throws java.lang.Exception
UpdateableClustererupdateClusterer in interface UpdateableClusterernewInstance - the instance to be addedjava.lang.Exception - if something goes wrongpublic double[] distributionForInstance(Instance instance) throws java.lang.Exception
AbstractClustererdistributionForInstance in interface ClustererdistributionForInstance in class AbstractClustererinstance - the instance to be assigned a cluster.java.lang.Exception - if distribution could not be computed successfullypublic void updateFinished()
UpdateableClustererupdateFinished in interface UpdateableClustererpublic void initializeDistanceFunction(Instances init) throws java.lang.Exception
init - the instances to initialize withjava.lang.Exception - if a problem occurspublic void buildClusterer(Instances data) throws java.lang.Exception
AbstractClustererbuildClusterer in interface ClustererbuildClusterer in class AbstractClustererdata - set of instances serving as training datajava.lang.Exception - if the clusterer has not been generated successfullypublic int numberOfClusters()
throws java.lang.Exception
AbstractClusterernumberOfClusters in interface ClusterernumberOfClusters in class AbstractClustererjava.lang.Exception - if number of clusters could not be returned
successfullypublic void setMissingValuesReplacer(Filter missingReplacer)
missingReplacer - the missing values replacement filter to usepublic Instances getCanopies()
public void setCanopies(Instances canopies)
canopies - the canopies to usepublic java.util.List<long[]> getClusterCanopyAssignments()
public void setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
clusterCanopies - the list canopies for each cluster centerpublic double getActualT2()
public double getActualT1()
public java.lang.String t1TipText()
public void setT1(double t1)
t1 - the T1 distance to usepublic double getT1()
public java.lang.String t2TipText()
public void setT2(double t2)
t2 - the T2 distance to usepublic double getT2()
public java.lang.String numClustersTipText()
public void setNumClusters(int numClusters)
throws java.lang.Exception
NumberOfClustersRequestablesetNumClusters in interface NumberOfClustersRequestablenumClusters - the number of clusters to generatejava.lang.Exception - if the requested number of
clusters in inapropriatepublic int getNumClusters()
public java.lang.String periodicPruningRateTipText()
public void setPeriodicPruningRate(int p)
p - how often (every p instances) to prune low density canopiespublic int getPeriodicPruningRate()
public java.lang.String minimumCanopyDensityTipText()
public void setMinimumCanopyDensity(double dens)
dens - the minimum canopy densitypublic double getMinimumCanopyDensity()
public java.lang.String maxNumCandidateCanopiesToHoldInMemory()
public void setMaxNumCandidateCanopiesToHoldInMemory(int max)
max - the maximum number of candidate canopies to retain in memory
during trainingpublic int getMaxNumCandidateCanopiesToHoldInMemory()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r - true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public static java.lang.String printSingleAssignment(long[] assignments)
public static java.lang.String printCanopyAssignments(Instances dataPoints, java.util.List<long[]> canopyAssignments)
dataPoints - the instances to printcanopyAssignments - the canopy assignments, one assignment array for
each instancepublic java.lang.String toString(boolean header)
header - true if the header should be printedpublic java.lang.String toString()
toString in class java.lang.Objectpublic void cleanUp()
public static Canopy aggregateCanopies(java.util.List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)
canopies - the list of Canopy clusterers to aggregateaggregationT1 - the T1 distance to use for the aggregated classifieraggregationT2 - the T2 distance to use when aggregating canopiesfinalDistanceFunction - the distance function to use with the final
Canopy clusterermissingValuesReplacer - the missing value replacement filter to use
with the final clusterer (can be null for no missing value
replacement)finalNumCanopies - the final number of canopiespublic static void main(java.lang.String[] args)