public class DictionarySaver extends AbstractFileSaver implements BatchConverter, IncrementalConverter
-binary-dict Save as a binary serialized dictionary
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-P <integer> Prune the dictionary every x instances (default = 0 - i.e. no periodic pruning)
-W <integer> The number of words (per class if there is a class attribute assigned) to attempt to keep.
-M <integer> The minimum term frequency to use when pruning the dictionary (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-sort Sort the dictionary alphabetically
-i <the input file> The input file
-o <the output file> The output file
BATCH, INCREMENTAL, NONE
Constructor and Description |
---|
DictionarySaver() |
Modifier and Type | Method and Description |
---|---|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this saver.
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
java.lang.String |
getFileDescription()
to be pverridden
|
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getKeepDictionarySorted()
Get whether to keep the dictionary sorted alphabetically or not
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
long |
getPeriodicPruning()
Gets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
java.lang.String |
getRevision()
Returns the revision string.
|
boolean |
getSaveBinaryDictionary()
Get whether to save the dictionary as a binary serialized dictionary,
rather than a plain text one
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
globalInfo()
Returns a string describing this Saver.
|
static void |
main(java.lang.String[] args) |
void |
resetOptions()
resets the options
|
void |
resetWriter()
Sets the writer to null.
|
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setDestination(java.io.OutputStream output)
Sets the destination output stream.
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setKeepDictionarySorted(boolean sorted)
Set whether to keep the dictionary sorted alphabetically or not
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setPeriodicPruning(long newPeriodicPruning)
Sets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
void |
setSaveBinaryDictionary(boolean binary)
Set whether to save the dictionary as a binary serialized dictionary,
rather than a plain text one
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
void |
writeBatch()
Writes to a file in batch mode To be overridden.
|
void |
writeIncremental(Instance inst)
Method for incremental saving.
|
cancel, filePrefix, getFileExtension, getFileExtensions, getOptions, getUseRelativePath, getWriter, listOptions, retrieveDir, retrieveFile, runFileSaver, setDestination, setDir, setDirAndPrefix, setEnvironment, setFile, setFilePrefix, setOptions, setUseRelativePath, useRelativePathTipText
doNotCheckCapabilitiesTipText, getDoNotCheckCapabilities, getInstances, getWriteMode, resetStructure, setDoNotCheckCapabilities, setInstances, setRetrieval, setStructure
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
makeCopy
public java.lang.String globalInfo()
@OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary)
binary
- true if the dictionary is to be saved as binary rather than
plain textpublic boolean getSaveBinaryDictionary()
public java.lang.String getAttributeIndices()
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(java.lang.String rangeList)
rangeList
- a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException
- if an invalid range list is suppliedpublic boolean getInvertSelection()
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
invert
- the new invert settingpublic boolean getLowerCaseTokens()
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are to be
formed.@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public Stemmer getStemmer()
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
value
- the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public long getPeriodicPruning()
@OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning)
newPeriodicPruning
- the rate at which the dictionary is periodically
prunedpublic int getWordsToKeep()
@OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output vector (per
class if assigned).public int getMinTermFreq()
@OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq
- The new MinTermFreq value.public boolean getDoNotOperateOnPerClassBasis()
@OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis
value.@OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted)
sorted
- true to keep the dictionary sortedpublic boolean getKeepDictionarySorted()
public Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class AbstractSaver
Capabilities
public java.lang.String getFileDescription()
AbstractFileSaver
getFileDescription
in interface FileSourcedConverter
getFileDescription
in class AbstractFileSaver
public void writeIncremental(Instance inst) throws java.io.IOException
AbstractSaver
writeIncremental
in interface Saver
writeIncremental
in class AbstractSaver
inst
- the instance to be savedjava.io.IOException
- IOEXception if the instance acnnot be written to the
specified destinationpublic void writeBatch() throws java.io.IOException
AbstractSaver
writeBatch
in interface Saver
writeBatch
in class AbstractSaver
java.io.IOException
- exception if writting is not possiblepublic void resetOptions()
AbstractFileSaver
resetOptions
in class AbstractFileSaver
public void resetWriter()
AbstractFileSaver
resetWriter
in class AbstractFileSaver
public void setDestination(java.io.OutputStream output) throws java.io.IOException
AbstractFileSaver
setDestination
in interface Saver
setDestination
in class AbstractFileSaver
output
- the output stream.java.io.IOException
- throws an IOException if destination cannot be setpublic java.lang.String getRevision()
RevisionHandler
getRevision
in interface RevisionHandler
public static void main(java.lang.String[] args)