public class FixedDictionaryStringToWordVector extends SimpleStreamFilter implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler
-dictionary <path to dictionary file> The path to the dictionary to use
-binary-dict Dictionary file contains a binary serialized dictionary
-C Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-P <attribute name prefix> Specify a prefix for the created attribute names (default: "")
-T Set whether the word frequencies should be transformed into log(1+fij), where fij is the frequency of word i in document (instance) j.
-I Set whether the word frequencies in a document should be transformed into fij*log(num of Docs/num of docs with word i), where fij is the frequency of word i in document (instance) j.
-N Whether to normalize to average length of documents seen during dictionary construction
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-output-debug-info If set, filter is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, filter capabilities are not checked before filter is built (use with caution).
| Constructor and Description |
|---|
FixedDictionaryStringToWordVector() |
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this filter.
|
java.io.File |
getDictionaryFile()
Get the dictionary file to read from
|
DictionaryBuilder |
getDictionaryHandler()
Get the dictionary builder used to manage the dictionary and perform the
actual vectorization
|
boolean |
getDictionaryIsBinary() |
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
boolean |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be
normalized or not.
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
java.lang.String |
globalInfo()
Returns a string describing this filter.
|
static void |
main(java.lang.String[] args) |
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setDictionaryFile(java.io.File file)
Set the dictionary file to read from
|
void |
setDictionaryIsBinary(boolean binary)
Set whether the dictionary file contains a binary serialized dictionary,
rather than a plain text one
|
void |
setDictionarySource(java.io.InputStream source)
Set an input stream to load a binary serialized dictionary from, rather
than source it from a file
|
void |
setDictionarySource(java.io.Reader source)
Set an input reader to load a textual dictionary from, rather than source
it from a file
|
void |
setEnvironment(Environment env)
Set environment variables to use.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setNormalizeDocLength(boolean normalize)
Sets whether if the word frequencies for a document (instance) should be
normalized or not.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
batchFinished, inputsetInputFormatbatchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOptions, getOutputFormat, getRevision, isFirstBatchDone, isNewBatch, isOutputFormatDefined, listOptions, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, setOptions, toString, useFilter, wekaStaticWrapperequals, getClass, hashCode, notify, notifyAll, wait, wait, waitmakeCopypublic Capabilities getCapabilities()
getCapabilities in interface CapabilitiesHandlergetCapabilities in class FilterCapabilitiespublic DictionaryBuilder getDictionaryHandler()
public void setDictionarySource(java.io.InputStream source)
source - the input stream to read the dictionary frompublic void setDictionarySource(java.io.Reader source)
source - the input reader to read the dictionary from@OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(java.io.File file)
file - the file to read frompublic java.io.File getDictionaryFile()
@OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary)
binary - true if the dictionary is a binary serialized onepublic boolean getDictionaryIsBinary()
public boolean getOutputWordCounts()
@OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts - true if word counts should be output.public java.lang.String getAttributeIndices()
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(java.lang.String rangeList)
rangeList - a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException - if an invalid range list is suppliedpublic boolean getInvertSelection()
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
invert - the new invert settingpublic java.lang.String getAttributeNamePrefix()
@OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix - String to use as the attribute name prefix.public boolean getTFTransform()
@OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform)
TFTransform - true if word frequencies are to be transformed.public boolean getIDFTransform()
@OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform)
IDFTransform - true if the word frequecies are to be transformed@OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize)
normalize - the new type.public boolean getNormalizeDocLength()
public boolean getLowerCaseTokens()
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens - should be true if only lower case tokens are to be
formed.@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
value - the configured stemming algorithm, or nullNullStemmerpublic Stemmer getStemmer()
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
value - the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
value - the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String globalInfo()
SimpleFilterglobalInfo in class SimpleFilterpublic void setEnvironment(Environment env)
EnvironmentHandlersetEnvironment in interface EnvironmentHandlerenv - the environment variables to
usepublic static void main(java.lang.String[] args)