public class Canopy extends RandomizableClusterer implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler
@inproceedings{McCallum2000, author = {A. McCallum and K. Nigam and L.H. Ungar}, booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms}, pages = {169-178}, title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching}, year = {2000} }Valid options are:
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
Modifier and Type | Field and Description |
---|---|
static double |
DEFAULT_T1 |
static double |
DEFAULT_T2 |
Constructor and Description |
---|
Canopy() |
Modifier and Type | Method and Description |
---|---|
static Canopy |
aggregateCanopies(java.util.List<Canopy> canopies,
double aggregationT1,
double aggregationT2,
NormalizableDistance finalDistanceFunction,
Filter missingValuesReplacer,
int finalNumCanopies)
Aggregate the canopies from a list of Canopy clusterers together into one
final model.
|
long[] |
assignCanopies(Instance inst)
Uses T1 distance to assign canopies to the supplied instance.
|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
void |
cleanUp()
Save memory
|
double[] |
distributionForInstance(Instance instance)
Predicts the cluster memberships for a given instance.
|
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
double |
getActualT1()
Get the actual value of T1 (which may be different from the initial value
if the heuristic is used)
|
double |
getActualT2()
Get the actual value of T2 (which may be different from the initial value
if the heuristic is used)
|
Instances |
getCanopies()
Get the canopies (cluster centers).
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
java.util.List<long[]> |
getClusterCanopyAssignments()
Get the canopies that each canopy (cluster center) is within T1 distance of
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
int |
getMaxNumCandidateCanopiesToHoldInMemory()
Get the maximum number of candidate canopies to retain in memory during
training.
|
double |
getMinimumCanopyDensity()
Get the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
int |
getNumClusters()
Get the number of clusters to generate
|
java.lang.String[] |
getOptions()
Gets the current settings of Canopy.
|
int |
getPeriodicPruningRate()
Get the how often to prune low density canopies during training
|
double |
getT1()
Get the T1 distance.
|
double |
getT2()
Get the T2 distance to use.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer.
|
void |
initializeDistanceFunction(Instances init)
Initialize the distance function (i.e set min/max values for numeric
attributes) with the supplied instances.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args) |
java.lang.String |
maxNumCandidateCanopiesToHoldInMemory()
Returns the tip text for this property.
|
java.lang.String |
minimumCanopyDensityTipText()
Returns the tip text for this property.
|
static boolean |
nonEmptyCanopySetIntersection(long[] first,
long[] second)
Tests if two sets of canopies have a non-empty intersection
|
int |
numberOfClusters()
Returns the number of clusters.
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningRateTipText()
Returns the tip text for this property.
|
static java.lang.String |
printCanopyAssignments(Instances dataPoints,
java.util.List<long[]> canopyAssignments)
Print the supplied instances and their canopies
|
static java.lang.String |
printSingleAssignment(long[] assignments) |
void |
setCanopies(Instances canopies)
Set the canopies to use (replaces any learned by this clusterer already)
|
void |
setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
Set the canopies that each canopy (cluster center) is within T1 distance of
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setMaxNumCandidateCanopiesToHoldInMemory(int max)
Set the maximum number of candidate canopies to retain in memory during
training.
|
void |
setMinimumCanopyDensity(double dens)
Set the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
void |
setMissingValuesReplacer(Filter missingReplacer)
Set a ready-to-use missing values replacement filter
|
void |
setNumClusters(int numClusters)
Set the number of clusters to generate
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPeriodicPruningRate(int p)
Set the how often to prune low density canopies during training
|
void |
setT1(double t1)
Set the T1 distance.
|
void |
setT2(double t2)
Set the T2 distance to use.
|
java.lang.String |
t1TipText()
Tip text for this property
|
java.lang.String |
t2TipText()
Tip text for this property
|
java.lang.String |
toString() |
java.lang.String |
toString(boolean header)
Return a textual description of this clusterer
|
void |
updateClusterer(Instance newInstance)
Adds an instance to the clusterer.
|
void |
updateFinished()
Signals the end of the updating.
|
getSeed, seedTipText, setSeed
clusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilities
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
makeCopy
public static final double DEFAULT_T2
public static final double DEFAULT_T1
public java.lang.String globalInfo()
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandler
getTechnicalInformation
in interface TechnicalInformationHandler
public Capabilities getCapabilities()
getCapabilities
in interface Clusterer
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class AbstractClusterer
Capabilities
public java.util.Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
listOptions
in class RandomizableClusterer
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
setOptions
in interface OptionHandler
setOptions
in class RandomizableClusterer
options
- the list of options as an array of strings throws Exception
if an option is not supportedjava.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class RandomizableClusterer
public static boolean nonEmptyCanopySetIntersection(long[] first, long[] second) throws java.lang.Exception
first
- the first canopy setsecond
- the second canopy setjava.lang.Exception
- if a problem occurspublic long[] assignCanopies(Instance inst) throws java.lang.Exception
inst
- the instance to find covering canopies forjava.lang.Exception
- if a problem occurspublic void updateClusterer(Instance newInstance) throws java.lang.Exception
UpdateableClusterer
updateClusterer
in interface UpdateableClusterer
newInstance
- the instance to be addedjava.lang.Exception
- if something goes wrongpublic double[] distributionForInstance(Instance instance) throws java.lang.Exception
AbstractClusterer
distributionForInstance
in interface Clusterer
distributionForInstance
in class AbstractClusterer
instance
- the instance to be assigned a cluster.java.lang.Exception
- if distribution could not be computed successfullypublic void updateFinished()
UpdateableClusterer
updateFinished
in interface UpdateableClusterer
public void initializeDistanceFunction(Instances init) throws java.lang.Exception
init
- the instances to initialize withjava.lang.Exception
- if a problem occurspublic void buildClusterer(Instances data) throws java.lang.Exception
AbstractClusterer
buildClusterer
in interface Clusterer
buildClusterer
in class AbstractClusterer
data
- set of instances serving as training datajava.lang.Exception
- if the clusterer has not been generated successfullypublic int numberOfClusters() throws java.lang.Exception
AbstractClusterer
numberOfClusters
in interface Clusterer
numberOfClusters
in class AbstractClusterer
java.lang.Exception
- if number of clusters could not be returned
successfullypublic void setMissingValuesReplacer(Filter missingReplacer)
missingReplacer
- the missing values replacement filter to usepublic Instances getCanopies()
public void setCanopies(Instances canopies)
canopies
- the canopies to usepublic java.util.List<long[]> getClusterCanopyAssignments()
public void setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
clusterCanopies
- the list canopies for each cluster centerpublic double getActualT2()
public double getActualT1()
public java.lang.String t1TipText()
public void setT1(double t1)
t1
- the T1 distance to usepublic double getT1()
public java.lang.String t2TipText()
public void setT2(double t2)
t2
- the T2 distance to usepublic double getT2()
public java.lang.String numClustersTipText()
public void setNumClusters(int numClusters) throws java.lang.Exception
NumberOfClustersRequestable
setNumClusters
in interface NumberOfClustersRequestable
numClusters
- the number of clusters to generatejava.lang.Exception
- if the requested number of
clusters in inapropriatepublic int getNumClusters()
public java.lang.String periodicPruningRateTipText()
public void setPeriodicPruningRate(int p)
p
- how often (every p instances) to prune low density canopiespublic int getPeriodicPruningRate()
public java.lang.String minimumCanopyDensityTipText()
public void setMinimumCanopyDensity(double dens)
dens
- the minimum canopy densitypublic double getMinimumCanopyDensity()
public java.lang.String maxNumCandidateCanopiesToHoldInMemory()
public void setMaxNumCandidateCanopiesToHoldInMemory(int max)
max
- the maximum number of candidate canopies to retain in memory
during trainingpublic int getMaxNumCandidateCanopiesToHoldInMemory()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r
- true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public static java.lang.String printSingleAssignment(long[] assignments)
public static java.lang.String printCanopyAssignments(Instances dataPoints, java.util.List<long[]> canopyAssignments)
dataPoints
- the instances to printcanopyAssignments
- the canopy assignments, one assignment array for
each instancepublic java.lang.String toString(boolean header)
header
- true if the header should be printedpublic java.lang.String toString()
toString
in class java.lang.Object
public void cleanUp()
public static Canopy aggregateCanopies(java.util.List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)
canopies
- the list of Canopy clusterers to aggregateaggregationT1
- the T1 distance to use for the aggregated classifieraggregationT2
- the T2 distance to use when aggregating canopiesfinalDistanceFunction
- the distance function to use with the final
Canopy clusterermissingValuesReplacer
- the missing value replacement filter to use
with the final clusterer (can be null for no missing value
replacement)finalNumCanopies
- the final number of canopiespublic static void main(java.lang.String[] args)