DictionaryBuilder

java.lang.Object
- weka.core.DictionaryBuilder

All Implemented Interfaces:

java.io.Serializable, Aggregateable<DictionaryBuilder>, OptionHandler
```
public class DictionaryBuilder
extends java.lang.Object
implements Aggregateable<DictionaryBuilder>, OptionHandler, java.io.Serializable
```
Class for building and maintaining a dictionary of terms. Has methods for loading, saving and aggregating dictionaries. Supports loading/saving in binary and textual format. Textual format is expected to have one or two comma separated values per line of the format.
```
 term [,doc_count]
 
```
where
```
 doc_count
 
```
is the number of documents that the term has occurred in.
Version:

$Revision: 15574 $

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com)

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

DictionaryBuilder()

Constructors
Constructor and Description
`DictionaryBuilder()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`DictionaryBuilder`	`aggregate(DictionaryBuilder toAgg)` Aggregate an object with this one
`java.lang.String`	`attributeIndicesTipText()` Returns the tip text for this property.
`java.lang.String`	`attributeNamePrefixTipText()` Returns the tip text for this property.
`java.lang.String`	`doNotOperateOnPerClassBasisTipText()` Returns the tip text for this property.
`void`	`finalizeAggregation()` Call to complete the aggregation process.
`java.util.Map<java.lang.String,int[]>`	`finalizeDictionary()` Performs final pruning and consolidation according to the number of words to keep property.
`java.lang.String`	`getAttributeIndices()` Gets the current range selection.
`java.lang.String`	`getAttributeNamePrefix()` Get the attribute name prefix.
`double`	`getAverageDocLength()` Get the average document length to use when normalizing
`java.util.Map<java.lang.String,int[]>[]`	`getDictionaries(boolean minFrequencyPrune)` Get the current dictionary(s) (one per class for nominal class, if set).
`boolean`	`getDoNotOperateOnPerClassBasis()` Get the DoNotOperateOnPerClassBasis value.
`boolean`	`getIDFTransform()` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`Instances`	`getInputFormat()` Gets the currently set input format
`boolean`	`getInvertSelection()` Gets whether the supplied columns are to be processed or skipped.
`boolean`	`getLowerCaseTokens()` Gets whether if the tokens are to be downcased or not.
`int`	`getMinTermFreq()` Get the MinTermFreq value.
`boolean`	`getNormalize()` Get whether word frequencies for a document should be normalized
`java.lang.String[]`	`getOptions()` Gets the current settings of the DictionaryBuilder
`boolean`	`getOutputWordCounts()` Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
`long`	`getPeriodicPruning()` Gets the rate (number of instances) at which the dictionary is periodically pruned.
`Range`	`getSelectedRange()` Get the value of m_SelectedRange.
`boolean`	`getSortDictionary()` Get whether to keep the dictionary sorted alphabetically as it is built.
`Stemmer`	`getStemmer()` Returns the current stemming algorithm, null if none is used.
`StopwordsHandler`	`getStopwordsHandler()` Gets the stopwords handler.
`boolean`	`getTFTransform()` Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`Tokenizer`	`getTokenizer()` Returns the current tokenizer algorithm.
`Instances`	`getVectorizedFormat()` Get the output format
`int`	`getWordsToKeep()` Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`IDFTransformTipText()` Returns the tip text for this property.
`java.lang.String`	`invertSelectionTipText()` Returns the tip text for this property.
`java.util.Enumeration<Option>`	`listOptions()` Returns an enumeration describing the available options.
`void`	`loadDictionary(java.io.File toLoad, boolean plainText)` Load a dictionary from a file
`void`	`loadDictionary(java.io.InputStream is)` Load a binary dictionary from an input stream
`void`	`loadDictionary(java.io.Reader reader)` Load a textual dictionary from a reader
`void`	`loadDictionary(java.lang.String filename, boolean plainText)` Load a dictionary from a file
`java.lang.String`	`lowerCaseTokensTipText()` Returns the tip text for this property.
`java.lang.String`	`minTermFreqTipText()` Returns the tip text for this property.
`java.lang.String`	`normalizeDocLengthTipText()` Returns the tip text for this property.
`java.lang.String`	`normalizeTipText()` Tip text for this property
`java.lang.String`	`outputWordCountsTipText()` Returns the tip text for this property.
`java.lang.String`	`periodicPruningTipText()` Returns the tip text for this property.
`void`	`processInstance(Instance inst)` Process an instance by tokenizing string attributes and updating the dictionary.
`boolean`	`readyToVectorize()` Returns true if this DictionaryBuilder is ready to vectorize incoming instances
`void`	`reset()` Clear the dictionary(s)
`void`	`saveDictionary(java.io.File toSave, boolean plainText)` Save a dictionary
`void`	`saveDictionary(java.io.OutputStream os)` Save the dictionary in binary form
`void`	`saveDictionary(java.lang.String filename, boolean plainText)` Save the dictionary
`void`	`saveDictionary(java.io.Writer writer)` Save the dictionary in textual format
`void`	`setAttributeIndices(java.lang.String rangeList)` Sets which attributes are to be worked on.
`void`	`setAttributeIndicesArray(int[] attributes)` Sets which attributes are to be processed.
`void`	`setAttributeNamePrefix(java.lang.String newPrefix)` Set the attribute name prefix.
`void`	`setAverageDocLength(double averageDocLength)` Set the average document length to use when normalizing
`void`	`setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)` Set the DoNotOperateOnPerClassBasis value.
`void`	`setIDFTransform(boolean IDFTransform)` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`void`	`setInvertSelection(boolean invert)` Sets whether selected columns should be processed or skipped.
`void`	`setLowerCaseTokens(boolean downCaseTokens)` Sets whether if the tokens are to be downcased or not.
`void`	`setMinTermFreq(int newMinTermFreq)` Set the MinTermFreq value.
`void`	`setNormalize(boolean n)` Set whether word frequencies for a document should be normalized
`void`	`setOptions(java.lang.String[] options)` Parses a given list of options.
`void`	`setOutputWordCounts(boolean outputWordCounts)` Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
`void`	`setPeriodicPruning(long newPeriodicPruning)` Sets the rate (number of instances) at which the dictionary is periodically pruned
`void`	`setSelectedRange(java.lang.String newSelectedRange)` Set the value of m_SelectedRange.
`void`	`setSortDictionary(boolean sortDictionary)` Set whether to keep the dictionary sorted alphabetically as it is built.
`void`	`setStemmer(Stemmer value)` the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
`void`	`setStopwordsHandler(StopwordsHandler value)` Sets the stopwords handler to use.
`void`	`setTFTransform(boolean TFTransform)` Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`void`	`setTokenizer(Tokenizer value)` the tokenizer algorithm to use.
`void`	`setup(Instances inputFormat)`
`void`	`setWordsToKeep(int newWordsToKeep)` Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`sortDictionaryTipText()` Tip text for this property
`java.lang.String`	`stemmerTipText()` Returns the tip text for this property.
`java.lang.String`	`stopwordsHandlerTipText()` Returns the tip text for this property.
`java.lang.String`	`TFTransformTipText()` Returns the tip text for this property.
`java.lang.String`	`tokenizerTipText()` Returns the tip text for this property.
`Instances`	`vectorizeBatch(Instances batch, boolean setAvgDocLength)` Convert a batch of instances
`Instance`	`vectorizeInstance(Instance input)` Convert an input instance.
`Instance`	`vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory)` Convert an input instance.
`java.lang.String`	`wordsToKeepTipText()` Returns the tip text for this property.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface weka.core.OptionHandler
makeCopy

- Constructor Detail
  - DictionaryBuilder
```
public DictionaryBuilder()
```
- Method Detail
  - setAverageDocLength
```
@ProgrammaticProperty
public void setAverageDocLength(double averageDocLength)
```
    Set the average document length to use when normalizing
    
    Parameters:
    
    averageDocLength - the average document length to use
  - getAverageDocLength
```
public double getAverageDocLength()
```
    Get the average document length to use when normalizing
    
    Returns:
    
    the average document length
  - sortDictionaryTipText
```
public java.lang.String sortDictionaryTipText()
```
    Tip text for this property
    
    Returns:
    
    the tip text for this property
  - setSortDictionary
```
public void setSortDictionary(boolean sortDictionary)
```
    Set whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
    
    Parameters:
    
    sortDictionary - true to keep the dictionary sorted alphabetically
  - getSortDictionary
```
public boolean getSortDictionary()
```
    Get whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
    
    Returns:
    
    true to keep the dictionary sorted alphabetically
  - getOutputWordCounts
```
public boolean getOutputWordCounts()
```
    Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
    
    Returns:
    
    true if word counts should be output.
  - setOutputWordCounts
```
public void setOutputWordCounts(boolean outputWordCounts)
```
    Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
    
    Parameters:
    
    outputWordCounts - true if word counts should be output.
  - outputWordCountsTipText
```
public java.lang.String outputWordCountsTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getSelectedRange
```
public Range getSelectedRange()
```
    Get the value of m_SelectedRange.
    
    Returns:
    
    Value of m_SelectedRange.
  - setSelectedRange
```
public void setSelectedRange(java.lang.String newSelectedRange)
```
    Set the value of m_SelectedRange.
    
    Parameters:
    
    newSelectedRange - Value to assign to m_SelectedRange.
  - attributeIndicesTipText
```
public java.lang.String attributeIndicesTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getAttributeIndices
```
public java.lang.String getAttributeIndices()
```
    Gets the current range selection.
    
    Returns:
    
    a string containing a comma separated list of ranges
  - setAttributeIndices
```
public void setAttributeIndices(java.lang.String rangeList)
```
    Sets which attributes are to be worked on.
    
    Parameters:
    
    rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
    eg: first-3,5,6-last
    
    Throws:
    
    java.lang.IllegalArgumentException - if an invalid range list is supplied
  - setAttributeIndicesArray
```
public void setAttributeIndicesArray(int[] attributes)
```
    Sets which attributes are to be processed.
    
    Parameters:
    
    attributes - an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.
    
    Throws:
    
    java.lang.IllegalArgumentException - if an invalid set of ranges is supplied
  - invertSelectionTipText
```
public java.lang.String invertSelectionTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getInvertSelection
```
public boolean getInvertSelection()
```
    Gets whether the supplied columns are to be processed or skipped.
    
    Returns:
    
    true if the supplied columns will be kept
  - setInvertSelection
```
public void setInvertSelection(boolean invert)
```
    Sets whether selected columns should be processed or skipped.
    
    Parameters:
    
    invert - the new invert setting
  - getWordsToKeep
```
public int getWordsToKeep()
```
    Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
    
    Returns:
    
    the target number of words in the output vector (per class if assigned).
  - setWordsToKeep
```
public void setWordsToKeep(int newWordsToKeep)
```
    Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
    
    Parameters:
    
    newWordsToKeep - the target number of words in the output vector (per class if assigned).
  - wordsToKeepTipText
```
public java.lang.String wordsToKeepTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getPeriodicPruning
```
public long getPeriodicPruning()
```
    Gets the rate (number of instances) at which the dictionary is periodically pruned.
    
    Returns:
    
    the rate at which the dictionary is periodically pruned
  - setPeriodicPruning
```
public void setPeriodicPruning(long newPeriodicPruning)
```
    Sets the rate (number of instances) at which the dictionary is periodically pruned
    
    Parameters:
    
    newPeriodicPruning - the rate at which the dictionary is periodically pruned
  - periodicPruningTipText
```
public java.lang.String periodicPruningTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getTFTransform
```
public boolean getTFTransform()
```
    Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
    
    Returns:
    
    true if word frequencies are to be transformed.
  - setTFTransform
```
public void setTFTransform(boolean TFTransform)
```
    Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
    
    Parameters:
    
    TFTransform - true if word frequencies are to be transformed.
  - TFTransformTipText
```
public java.lang.String TFTransformTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getAttributeNamePrefix
```
public java.lang.String getAttributeNamePrefix()
```
    Get the attribute name prefix.
    
    Returns:
    
    The current attribute name prefix.
  - setAttributeNamePrefix
```
public void setAttributeNamePrefix(java.lang.String newPrefix)
```
    Set the attribute name prefix.
    
    Parameters:
    
    newPrefix - String to use as the attribute name prefix.
  - attributeNamePrefixTipText
```
public java.lang.String attributeNamePrefixTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getIDFTransform
```
public boolean getIDFTransform()
```
    Sets whether if the word frequencies in a document should be transformed into:
    fij*log(num of Docs/num of Docs with word i)
    where fij is the frequency of word i in document(instance) j.
    
    Returns:
    
    true if the word frequencies are to be transformed.
  - setIDFTransform
```
public void setIDFTransform(boolean IDFTransform)
```
    Sets whether if the word frequencies in a document should be transformed into:
    fij*log(num of Docs/num of Docs with word i)
    where fij is the frequency of word i in document(instance) j.
    
    Parameters:
    
    IDFTransform - true if the word frequecies are to be transformed
  - IDFTransformTipText
```
public java.lang.String IDFTransformTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getNormalize
```
public boolean getNormalize()
```
    Get whether word frequencies for a document should be normalized
    
    Returns:
    
    true if word frequencies should be normalized
  - setNormalize
```
public void setNormalize(boolean n)
```
    Set whether word frequencies for a document should be normalized
    
    Parameters:
    
    n - true if word frequencies should be normalized
  - normalizeTipText
```
public java.lang.String normalizeTipText()
```
    Tip text for this property
    
    Returns:
    
    the tip text for this property
  - normalizeDocLengthTipText
```
public java.lang.String normalizeDocLengthTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getLowerCaseTokens
```
public boolean getLowerCaseTokens()
```
    Gets whether if the tokens are to be downcased or not.
    
    Returns:
    
    true if the tokens are to be downcased.
  - setLowerCaseTokens
```
public void setLowerCaseTokens(boolean downCaseTokens)
```
    Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
    
    Parameters:
    
    downCaseTokens - should be true if only lower case tokens are to be formed.
  - lowerCaseTokensTipText
```
public java.lang.String lowerCaseTokensTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - doNotOperateOnPerClassBasisTipText
```
public java.lang.String doNotOperateOnPerClassBasisTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getDoNotOperateOnPerClassBasis
```
public boolean getDoNotOperateOnPerClassBasis()
```
    Get the DoNotOperateOnPerClassBasis value.
    
    Returns:
    
    the DoNotOperateOnPerClassBasis value.
  - setDoNotOperateOnPerClassBasis
```
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
```
    Set the DoNotOperateOnPerClassBasis value.
    
    Parameters:
    
    newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
  - minTermFreqTipText
```
public java.lang.String minTermFreqTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getMinTermFreq
```
public int getMinTermFreq()
```
    Get the MinTermFreq value.
    
    Returns:
    
    the MinTermFreq value.
  - setMinTermFreq
```
public void setMinTermFreq(int newMinTermFreq)
```
    Set the MinTermFreq value.
    
    Parameters:
    
    newMinTermFreq - The new MinTermFreq value.
  - getStemmer
```
public Stemmer getStemmer()
```
    Returns the current stemming algorithm, null if none is used.
    
    Returns:
    
    the current stemming algorithm, null if none set
  - setStemmer
```
public void setStemmer(Stemmer value)
```
    the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
    
    Parameters:
    
    value - the configured stemming algorithm, or null
    
    See Also:
    
    NullStemmer
  - stemmerTipText
```
public java.lang.String stemmerTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getStopwordsHandler
```
public StopwordsHandler getStopwordsHandler()
```
    Gets the stopwords handler.
    
    Returns:
    
    the stopwords handler
  - setStopwordsHandler
```
public void setStopwordsHandler(StopwordsHandler value)
```
    Sets the stopwords handler to use.
    
    Parameters:
    
    value - the stopwords handler, if null, Null is used
  - stopwordsHandlerTipText
```
public java.lang.String stopwordsHandlerTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getTokenizer
```
public Tokenizer getTokenizer()
```
    Returns the current tokenizer algorithm.
    
    Returns:
    
    the current tokenizer algorithm
  - setTokenizer
```
public void setTokenizer(Tokenizer value)
```
    the tokenizer algorithm to use.
    
    Parameters:
    
    value - the configured tokenizing algorithm
  - tokenizerTipText
```
public java.lang.String tokenizerTipText()
```
    Returns the tip text for this property.
    
    Returns:
    
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - listOptions
```
public java.util.Enumeration<Option> listOptions()
```
    Returns an enumeration describing the available options.
    
    Specified by:
    
    listOptions in interface OptionHandler
    
    Returns:
    
    an enumeration of all the available options
  - getOptions
```
public java.lang.String[] getOptions()
```
    Gets the current settings of the DictionaryBuilder
    
    Specified by:
    
    getOptions in interface OptionHandler
    
    Returns:
    
    an array of strings suitable for passing to setOptions
  - setOptions
```
public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
```
    Parses a given list of options.
    Valid options are:
```
 -C
  Output word counts rather than boolean word presence.
 
```
```
 -R <index1,index2-index4,...>
  Specify list of string attributes to convert to words (as weka Range).
  (default: select all string attributes)
 
```
```
 -V
  Invert matching sense of column indexes.
 
```
```
 -P <attribute name prefix>
  Specify a prefix for the created attribute names.
  (default: "")
 
```
```
 -W <number of words to keep>
  Specify approximate number of word fields to create.
  Surplus words will be discarded..
  (default: 1000)
 
```
```
 -prune-rate <rate as a percentage of dataset>
  Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary.
  -W prunes after creating a full dictionary. You may not have enough memory for this approach.
  (default: no periodic pruning)
 
```
```
 -T
  Transform the word frequencies into log(1+fij)
  where fij is the frequency of word i in jth document(instance).
 
```
```
 -I
  Transform each word frequency into:
  fij*log(num of Documents/num of documents containing word i)
    where fij if frequency of word i in jth document(instance)
 
```
```
 -N
  Whether to 0=not normalize/1=normalize all data/2=normalize test data only
  to average length of training documents (default 0=don't normalize).
 
```
```
 -L
  Convert all tokens to lowercase before adding to the dictionary.
 
```
```
 -stopwords-handler
  The stopwords handler to use (default Null).
 
```
```
 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
 
```
```
 -M <int>
  The minimum term frequency (default = 1).
 
```
```
 -O
  If this is set, the maximum number of words and the
  minimum term frequency is not enforced on a per-class
  basis but based on the documents in all the classes
  (even if a class attribute is set).
 
```
```
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 
```
    Specified by:
    
    setOptions in interface OptionHandler
    
    Parameters:
    
    options - the list of options as an array of strings
    
    Throws:
    
    java.lang.Exception - if an option is not supported
  - setup
```
public void setup(Instances inputFormat)
           throws java.lang.Exception
```
    Throws:
    
    java.lang.Exception
  - getInputFormat
```
public Instances getInputFormat()
```
    Gets the currently set input format
    
    Returns:
    
    the current input format
  - readyToVectorize
```
public boolean readyToVectorize()
```
    Returns true if this DictionaryBuilder is ready to vectorize incoming instances
    
    Returns:
    
    true if we can vectorize incoming instances
  - getVectorizedFormat
```
public Instances getVectorizedFormat()
                              throws java.lang.Exception
```
    Get the output format
    
    Returns:
    
    the output format
    
    Throws:
    
    java.lang.Exception - if there is no input format set and/or the dictionary has not been constructed yet.
  - vectorizeBatch
```
public Instances vectorizeBatch(Instances batch,
                                boolean setAvgDocLength)
                         throws java.lang.Exception
```
    Convert a batch of instances
    
    Parameters:
    
    batch - the batch to convert.
    
    setAvgDocLength - true to compute and set the average document length for this DictionaryBuilder from the batch - this uses the final pruned dictionary when computing doc lengths. When vectorizing non-training batches, and normalization has been turned on, this should be set to false.
    
    Returns:
    
    the converted batch
    
    Throws:
    
    java.lang.Exception - if there is no input format set and/or the dictionary has not been constructed yet.
  - vectorizeInstance
```
public Instance vectorizeInstance(Instance input)
                           throws java.lang.Exception
```
    Convert an input instance. Any string attributes not being vectorized do not have their values retained in memory (i.e. only the string values for the instance being vectorized are held in memory).
    
    Parameters:
    
    input - the input instance
    
    Returns:
    
    a converted instance
    
    Throws:
    
    java.lang.Exception - if there is no input format set and/or the dictionary has not been constructed yet.
  - vectorizeInstance
```
public Instance vectorizeInstance(Instance input,
                                  boolean retainStringAttValuesInMemory)
                           throws java.lang.Exception
```
    Convert an input instance.
    
    Parameters:
    
    input - the input instance
    
    retainStringAttValuesInMemory - true if the values of string attributes not being vectorized should be retained in memory
    
    Returns:
    
    a converted instance
    
    Throws:
    
    java.lang.Exception - if there is no input format set and/or the dictionary has not been constructed yet
  - processInstance
```
public void processInstance(Instance inst)
```
    Process an instance by tokenizing string attributes and updating the dictionary.
    
    Parameters:
    
    inst - the instance to process
  - reset
```
public void reset()
```
    Clear the dictionary(s)
  - getDictionaries
```
public java.util.Map<java.lang.String,int[]>[] getDictionaries(boolean minFrequencyPrune)
                                                        throws WekaException
```
    Get the current dictionary(s) (one per class for nominal class, if set). These are the dictionaries that are built/updated when processInstance() is called. The finalized dictionary (used for vectorization) can be obtained by calling finalizeDictionary() - this returns a consolidated (over classes) and pruned final dictionary.
    
    Parameters:
    
    minFrequencyPrune - prune the dictionaries of low frequency terms before returning them
    
    Returns:
    
    the dictionaries
    
    Throws:
    
    WekaException
  - aggregate
```
public DictionaryBuilder aggregate(DictionaryBuilder toAgg)
                            throws java.lang.Exception
```
    Description copied from interface: Aggregateable
    
    Aggregate an object with this one
    
    Specified by:
    
    aggregate in interface Aggregateable<DictionaryBuilder>
    
    Parameters:
    
    toAgg - the object to aggregate
    
    Returns:
    
    the result of aggregation
    
    Throws:
    
    java.lang.Exception - if the supplied object can't be aggregated for some reason
  - finalizeAggregation
```
public void finalizeAggregation()
                         throws java.lang.Exception
```
    Description copied from interface: Aggregateable
    
    Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
    
    Specified by:
    
    finalizeAggregation in interface Aggregateable<DictionaryBuilder>
    
    Throws:
    
    java.lang.Exception - if the aggregation can't be finalized for some reason
  - finalizeDictionary
```
public java.util.Map<java.lang.String,int[]> finalizeDictionary()
                                                         throws java.lang.Exception
```
    Performs final pruning and consolidation according to the number of words to keep property. Finalization is performed just once, subsequent calls to this method return the finalized dictionary computed on the first call (unless reset() has been called in between).
    
    Returns:
    
    the consolidated and pruned final dictionary, or null if the input format did not contain any string attributes within the selected range to process
    
    Throws:
    
    java.lang.Exception - if a problem occurs
  - loadDictionary
```
public void loadDictionary(java.lang.String filename,
                           boolean plainText)
                    throws java.io.IOException
```
    Load a dictionary from a file
    
    Parameters:
    
    filename - the file to load from
    
    plainText - true if the dictionary is in text format
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - loadDictionary
```
public void loadDictionary(java.io.File toLoad,
                           boolean plainText)
                    throws java.io.IOException
```
    Load a dictionary from a file
    
    Parameters:
    
    toLoad - the file to load from
    
    plainText - true if the dictionary is in text format
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - loadDictionary
```
public void loadDictionary(java.io.Reader reader)
                    throws java.io.IOException
```
    Load a textual dictionary from a reader
    
    Parameters:
    
    reader - the reader to read from
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - loadDictionary
```
public void loadDictionary(java.io.InputStream is)
                    throws java.io.IOException
```
    Load a binary dictionary from an input stream
    
    Parameters:
    
    is - the input stream to read from
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - saveDictionary
```
public void saveDictionary(java.lang.String filename,
                           boolean plainText)
                    throws java.io.IOException
```
    Save the dictionary
    
    Parameters:
    
    filename - the file to save to
    
    plainText - true if the dictionary should be saved in text format
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - saveDictionary
```
public void saveDictionary(java.io.File toSave,
                           boolean plainText)
                    throws java.io.IOException
```
    Save a dictionary
    
    Parameters:
    
    toSave - the file to save to
    
    plainText - true if the dictionary should be saved in text format
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - saveDictionary
```
public void saveDictionary(java.io.Writer writer)
                    throws java.io.IOException
```
    Save the dictionary in textual format
    
    Parameters:
    
    writer - the writer to write to
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - saveDictionary
```
public void saveDictionary(java.io.OutputStream os)
                    throws java.io.IOException
```
    Save the dictionary in binary form
    
    Parameters:
    
    os - the output stream to write to
    
    Throws:
    
    java.io.IOException - if a problem occurs

Class DictionaryBuilder

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface weka.core.OptionHandler

Constructor Detail

DictionaryBuilder

Method Detail

setAverageDocLength

getAverageDocLength

sortDictionaryTipText

setSortDictionary

getSortDictionary

getOutputWordCounts

setOutputWordCounts

outputWordCountsTipText

getSelectedRange

setSelectedRange

attributeIndicesTipText

getAttributeIndices

setAttributeIndices

setAttributeIndicesArray

invertSelectionTipText

getInvertSelection

setInvertSelection

getWordsToKeep

setWordsToKeep

wordsToKeepTipText

getPeriodicPruning

setPeriodicPruning

periodicPruningTipText

getTFTransform

setTFTransform

TFTransformTipText

getAttributeNamePrefix

setAttributeNamePrefix

attributeNamePrefixTipText

getIDFTransform

setIDFTransform

IDFTransformTipText

getNormalize

setNormalize

normalizeTipText

normalizeDocLengthTipText

getLowerCaseTokens

setLowerCaseTokens

lowerCaseTokensTipText

doNotOperateOnPerClassBasisTipText

getDoNotOperateOnPerClassBasis

setDoNotOperateOnPerClassBasis

minTermFreqTipText

getMinTermFreq

setMinTermFreq

getStemmer

setStemmer

stemmerTipText

getStopwordsHandler

setStopwordsHandler

stopwordsHandlerTipText

getTokenizer

setTokenizer

tokenizerTipText

listOptions

getOptions

setOptions

setup

getInputFormat

readyToVectorize

getVectorizedFormat

vectorizeBatch

vectorizeInstance

vectorizeInstance

processInstance

reset

getDictionaries

aggregate

finalizeAggregation

finalizeDictionary

loadDictionary

loadDictionary