DictionarySaver

java.lang.Object
- weka.core.converters.AbstractSaver
- - weka.core.converters.AbstractFileSaver
  - - weka.core.converters.DictionarySaver

All Implemented Interfaces:: java.io.Serializable, CapabilitiesHandler, CapabilitiesIgnorer, BatchConverter, FileSourcedConverter, IncrementalConverter, Saver, EnvironmentHandler, OptionHandler, RevisionHandler

public class DictionarySaver
extends AbstractFileSaver
implements BatchConverter, IncrementalConverter

Writes a dictionary constructed from string attributes in incoming instances to a destination.

Valid options are:

 -binary-dict
  Save as a binary serialized dictionary

 -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.

 -V
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed

 -L
  Convert all tokens to lowercase when matching against dictionary entries.

 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.

 -stopwords-handler <spec>
  The stopwords handler to use (default = Null)

 -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

 -P <integer>
  Prune the dictionary every x instances
  (default = 0 - i.e. no periodic pruning)

 -W <integer>
  The number of words (per class if there is a class attribute assigned) to attempt to keep.

 -M <integer>
  The minimum term frequency to use when pruning the dictionary
  (default = 1).

 -O
  If this is set, the maximum number of words and the
  minimum term frequency is not enforced on a per-class
  basis but based on the documents in all the classes
  (even if a class attribute is set).

 -sort
  Sort the dictionary alphabetically

 -i <the input file>
  The input file

 -o <the output file>
  The output file

Version:: $Revision: 12690 $
Author:: Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:: Serialized Form

Field Summary
- Fields inherited from interface weka.core.converters.Saver
  BATCH, INCREMENTAL, NONE

Constructor Summary

Constructors
Constructor and Description

DictionarySaver()

Constructors
Constructor and Description
`DictionarySaver()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.lang.String`	`getAttributeIndices()` Gets the current range selection.
`Capabilities`	`getCapabilities()` Returns the Capabilities of this saver.
`boolean`	`getDoNotOperateOnPerClassBasis()` Get the DoNotOperateOnPerClassBasis value.
`java.lang.String`	`getFileDescription()` to be pverridden
`boolean`	`getInvertSelection()` Gets whether the supplied columns are to be processed or skipped.
`boolean`	`getKeepDictionarySorted()` Get whether to keep the dictionary sorted alphabetically or not
`boolean`	`getLowerCaseTokens()` Gets whether if the tokens are to be downcased or not.
`int`	`getMinTermFreq()` Get the MinTermFreq value.
`long`	`getPeriodicPruning()` Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
`java.lang.String`	`getRevision()` Returns the revision string.
`boolean`	`getSaveBinaryDictionary()` Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
`Stemmer`	`getStemmer()` Returns the current stemming algorithm, null if none is used.
`StopwordsHandler`	`getStopwordsHandler()` Gets the stopwords handler.
`Tokenizer`	`getTokenizer()` Returns the current tokenizer algorithm.
`int`	`getWordsToKeep()` Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`globalInfo()` Returns a string describing this Saver.
`static void`	`main(java.lang.String[] args)`
`void`	`resetOptions()` resets the options
`void`	`resetWriter()` Sets the writer to null.
`void`	`setAttributeIndices(java.lang.String rangeList)` Sets which attributes are to be worked on.
`void`	`setDestination(java.io.OutputStream output)` Sets the destination output stream.
`void`	`setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)` Set the DoNotOperateOnPerClassBasis value.
`void`	`setInvertSelection(boolean invert)` Sets whether selected columns should be processed or skipped.
`void`	`setKeepDictionarySorted(boolean sorted)` Set whether to keep the dictionary sorted alphabetically or not
`void`	`setLowerCaseTokens(boolean downCaseTokens)` Sets whether if the tokens are to be downcased or not.
`void`	`setMinTermFreq(int newMinTermFreq)` Set the MinTermFreq value.
`void`	`setPeriodicPruning(long newPeriodicPruning)` Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
`void`	`setSaveBinaryDictionary(boolean binary)` Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
`void`	`setStemmer(Stemmer value)` the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
`void`	`setStopwordsHandler(StopwordsHandler value)` Sets the stopwords handler to use.
`void`	`setTokenizer(Tokenizer value)` the tokenizer algorithm to use.
`void`	`setWordsToKeep(int newWordsToKeep)` Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`void`	`writeBatch()` Writes to a file in batch mode To be overridden.
`void`	`writeIncremental(Instance inst)` Method for incremental saving.

Methods inherited from class weka.core.converters.AbstractFileSaver
cancel, filePrefix, getFileExtension, getFileExtensions, getOptions, getUseRelativePath, getWriter, listOptions, retrieveDir, retrieveFile, runFileSaver, setDestination, setDir, setDirAndPrefix, setEnvironment, setFile, setFilePrefix, setOptions, setUseRelativePath, useRelativePathTipText

Methods inherited from class weka.core.converters.AbstractSaver
doNotCheckCapabilitiesTipText, getDoNotCheckCapabilities, getInstances, getWriteMode, resetStructure, setDoNotCheckCapabilities, setInstances, setRetrieval, setStructure

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface weka.core.OptionHandler
makeCopy

Constructor Detail
- DictionarySaver
```
public DictionarySaver()
```

Method Detail

globalInfo
```
public java.lang.String globalInfo()
```
Returns a string describing this Saver.

Returns:

a description of the Saver suitable for displaying in the explorer/experimenter gui

setSaveBinaryDictionary

@OptionMetadata(displayName="Save dictionary in binary form",
                description="Save as a binary serialized dictionary",
                commandLineParamName="binary-dict",
                commandLineParamSynopsis="-binary-dict",
                commandLineParamIsFlag=true,
                displayOrder=2)
public void setSaveBinaryDictionary(boolean binary)

Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one

Parameters:: binary - true if the dictionary is to be saved as binary rather than plain text

getSaveBinaryDictionary
```
public boolean getSaveBinaryDictionary()
```
Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one

Returns:

true if the dictionary is to be saved as binary rather than plain text

getAttributeIndices
```
public java.lang.String getAttributeIndices()
```
Gets the current range selection.

Returns:

a string containing a comma separated list of ranges

setAttributeIndices

@OptionMetadata(displayName="Range of attributes to operate on",
                description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.",
                commandLineParamName="R",
                commandLineParamSynopsis="-R <range>",
                displayOrder=4)
public void setAttributeIndices(java.lang.String rangeList)

Sets which attributes are to be worked on.

Parameters:: rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last
Throws:: java.lang.IllegalArgumentException - if an invalid range list is supplied

getInvertSelection
```
public boolean getInvertSelection()
```
Gets whether the supplied columns are to be processed or skipped.

Returns:

true if the supplied columns will be kept

setInvertSelection

@OptionMetadata(displayName="Invert selection",
                description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed",
                commandLineParamName="V",
                commandLineParamSynopsis="-V",
                commandLineParamIsFlag=true,
                displayOrder=5)
public void setInvertSelection(boolean invert)

Sets whether selected columns should be processed or skipped.

Parameters:: invert - the new invert setting

getLowerCaseTokens
```
public boolean getLowerCaseTokens()
```
Gets whether if the tokens are to be downcased or not.

Returns:

true if the tokens are to be downcased.

setLowerCaseTokens

@OptionMetadata(displayName="Lower case tokens",
                description="Convert all tokens to lowercase when matching against dictionary entries.",
                commandLineParamName="L",
                commandLineParamSynopsis="-L",
                commandLineParamIsFlag=true,
                displayOrder=10)
public void setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:: downCaseTokens - should be true if only lower case tokens are to be formed.

setStemmer

@OptionMetadata(displayName="Stemmer to use",
                description="The stemming algorithm (classname plus parameters) to use.",
                commandLineParamName="stemmer",
                commandLineParamSynopsis="-stemmer <spec>",
                displayOrder=11)
public void setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

Parameters:: value - the configured stemming algorithm, or null
See Also:: NullStemmer

getStemmer
```
public Stemmer getStemmer()
```
Returns the current stemming algorithm, null if none is used.

Returns:

the current stemming algorithm, null if none set

setStopwordsHandler

@OptionMetadata(displayName="Stop words handler",
                description="The stopwords handler to use (default = Null)",
                commandLineParamName="stopwords-handler",
                commandLineParamSynopsis="-stopwords-handler <spec>",
                displayOrder=12)
public void setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

Parameters:: value - the stopwords handler, if null, Null is used

getStopwordsHandler
```
public StopwordsHandler getStopwordsHandler()
```
Gets the stopwords handler.

Returns:

the stopwords handler

setTokenizer

@OptionMetadata(displayName="Tokenizer",
                description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)",
                commandLineParamName="tokenizer",
                commandLineParamSynopsis="-tokenizer <spec>",
                displayOrder=13)
public void setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

Parameters:: value - the configured tokenizing algorithm

getTokenizer
```
public Tokenizer getTokenizer()
```
Returns the current tokenizer algorithm.

Returns:

the current tokenizer algorithm

getPeriodicPruning
```
public long getPeriodicPruning()
```
Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.

Returns:

the rate at which the dictionary is periodically pruned

setPeriodicPruning

@OptionMetadata(displayName="Periodic pruning rate",
                description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)",
                commandLineParamName="P",
                commandLineParamSynopsis="-P <integer>",
                displayOrder=14)
public void setPeriodicPruning(long newPeriodicPruning)

Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.

Parameters:: newPeriodicPruning - the rate at which the dictionary is periodically pruned

getWordsToKeep
```
public int getWordsToKeep()
```
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:

the target number of words in the output vector (per class if assigned).

setWordsToKeep

@OptionMetadata(displayName="Number of words to attempt to keep",
                description="The number of words (per class if there is a class attribute assigned) to attempt to keep.",
                commandLineParamName="W",
                commandLineParamSynopsis="-W <integer>",
                displayOrder=15)
public void setWordsToKeep(int newWordsToKeep)

Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:: newWordsToKeep - the target number of words in the output vector (per class if assigned).

getMinTermFreq
```
public int getMinTermFreq()
```
Get the MinTermFreq value.

Returns:

the MinTermFreq value.

setMinTermFreq

@OptionMetadata(displayName="Minimum term frequency",
                description="The minimum term frequency to use when pruning the dictionary\n(default = 1).",
                commandLineParamName="M",
                commandLineParamSynopsis="-M <integer>",
                displayOrder=16)
public void setMinTermFreq(int newMinTermFreq)

Set the MinTermFreq value.

Parameters:: newMinTermFreq - The new MinTermFreq value.

getDoNotOperateOnPerClassBasis
```
public boolean getDoNotOperateOnPerClassBasis()
```
Get the DoNotOperateOnPerClassBasis value.

Returns:

the DoNotOperateOnPerClassBasis value.

setDoNotOperateOnPerClassBasis

@OptionMetadata(displayName="Do not operate on a per-class basis",
                description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).",
                commandLineParamName="O",
                commandLineParamSynopsis="-O",
                commandLineParamIsFlag=true,
                displayOrder=17)
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)

Set the DoNotOperateOnPerClassBasis value.

Parameters:: newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.

setKeepDictionarySorted

@OptionMetadata(displayName="Sort dictionary",
                description="Sort the dictionary alphabetically",
                commandLineParamName="sort",
                commandLineParamSynopsis="-sort",
                commandLineParamIsFlag=true,
                displayOrder=18)
public void setKeepDictionarySorted(boolean sorted)

Set whether to keep the dictionary sorted alphabetically or not

Parameters:: sorted - true to keep the dictionary sorted

getKeepDictionarySorted
```
public boolean getKeepDictionarySorted()
```
Get whether to keep the dictionary sorted alphabetically or not

Returns:

true to keep the dictionary sorted

getCapabilities
```
public Capabilities getCapabilities()
```
Returns the Capabilities of this saver.

Specified by:

getCapabilities in interface CapabilitiesHandler

Overrides:

getCapabilities in class AbstractSaver

Returns:

the capabilities of this object

See Also:

Capabilities

getFileDescription
```
public java.lang.String getFileDescription()
```
Description copied from class: AbstractFileSaver

to be pverridden

Specified by:

getFileDescription in interface FileSourcedConverter

Specified by:

getFileDescription in class AbstractFileSaver

Returns:

the file type description.

writeIncremental
```
public void writeIncremental(Instance inst)
                      throws java.io.IOException
```
Description copied from class: AbstractSaver

Method for incremental saving. Standard behaviour: no incremental saving is possible, therefore throw an IOException. An incremental saving process is stopped by calling this method with null.

Specified by:

writeIncremental in interface Saver

Overrides:

writeIncremental in class AbstractSaver

Parameters:

inst - the instance to be saved

Throws:

java.io.IOException - IOEXception if the instance acnnot be written to the specified destination

writeBatch
```
public void writeBatch()
                throws java.io.IOException
```
Description copied from class: AbstractSaver

Writes to a file in batch mode To be overridden.

Specified by:

writeBatch in interface Saver

Specified by:

writeBatch in class AbstractSaver

Throws:

java.io.IOException - exception if writting is not possible

resetOptions
```
public void resetOptions()
```
Description copied from class: AbstractFileSaver

resets the options

Overrides:

resetOptions in class AbstractFileSaver

resetWriter
```
public void resetWriter()
```
Description copied from class: AbstractFileSaver

Sets the writer to null.

Overrides:

resetWriter in class AbstractFileSaver

setDestination
```
public void setDestination(java.io.OutputStream output)
                    throws java.io.IOException
```
Description copied from class: AbstractFileSaver

Sets the destination output stream.

Specified by:

setDestination in interface Saver

Overrides:

setDestination in class AbstractFileSaver

Parameters:

output - the output stream.

Throws:

java.io.IOException - throws an IOException if destination cannot be set

getRevision
```
public java.lang.String getRevision()
```
Description copied from interface: RevisionHandler

Returns the revision string.

Specified by:

getRevision in interface RevisionHandler

Returns:

the revision

main

public static void main(java.lang.String[] args)

Class DictionarySaver

Field Summary

Fields inherited from interface weka.core.converters.Saver

Constructor Summary

Method Summary

Methods inherited from class weka.core.converters.AbstractFileSaver

Methods inherited from class weka.core.converters.AbstractSaver

Methods inherited from class java.lang.Object

Methods inherited from interface weka.core.OptionHandler

Constructor Detail

DictionarySaver

Method Detail

globalInfo

setSaveBinaryDictionary

getSaveBinaryDictionary

getAttributeIndices

setAttributeIndices

getInvertSelection

setInvertSelection

getLowerCaseTokens

setLowerCaseTokens

setStemmer

getStemmer

setStopwordsHandler

getStopwordsHandler

setTokenizer

getTokenizer

getPeriodicPruning

setPeriodicPruning

getWordsToKeep

setWordsToKeep

getMinTermFreq

setMinTermFreq

getDoNotOperateOnPerClassBasis

setDoNotOperateOnPerClassBasis

setKeepDictionarySorted

getKeepDictionarySorted

getCapabilities

getFileDescription

writeIncremental

writeBatch

resetOptions

resetWriter

setDestination

getRevision

main