public class FixedDictionaryStringToWordVector extends SimpleStreamFilter implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler
-dictionary <path to dictionary file> The path to the dictionary to use
-binary-dict Dictionary file contains a binary serialized dictionary
-C Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-P <attribute name prefix> Specify a prefix for the created attribute names (default: "")
-T Set whether the word frequencies should be transformed into log(1+fij), where fij is the frequency of word i in document (instance) j.
-I Set whether the word frequencies in a document should be transformed into fij*log(num of Docs/num of docs with word i), where fij is the frequency of word i in document (instance) j.
-N Whether to normalize to average length of documents seen during dictionary construction
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-output-debug-info If set, filter is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, filter capabilities are not checked before filter is built (use with caution).
Constructor and Description |
---|
FixedDictionaryStringToWordVector() |
Modifier and Type | Method and Description |
---|---|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this filter.
|
java.io.File |
getDictionaryFile()
Get the dictionary file to read from
|
DictionaryBuilder |
getDictionaryHandler()
Get the dictionary builder used to manage the dictionary and perform the
actual vectorization
|
boolean |
getDictionaryIsBinary() |
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
boolean |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be
normalized or not.
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
java.lang.String |
globalInfo()
Returns a string describing this filter.
|
static void |
main(java.lang.String[] args) |
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setDictionaryFile(java.io.File file)
Set the dictionary file to read from
|
void |
setDictionaryIsBinary(boolean binary)
Set whether the dictionary file contains a binary serialized dictionary,
rather than a plain text one
|
void |
setDictionarySource(java.io.InputStream source)
Set an input stream to load a binary serialized dictionary from, rather
than source it from a file
|
void |
setDictionarySource(java.io.Reader source)
Set an input reader to load a textual dictionary from, rather than source
it from a file
|
void |
setEnvironment(Environment env)
Set environment variables to use.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setNormalizeDocLength(boolean normalize)
Sets whether if the word frequencies for a document (instance) should be
normalized or not.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
batchFinished, input
setInputFormat
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOptions, getOutputFormat, getRevision, isFirstBatchDone, isNewBatch, isOutputFormatDefined, listOptions, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, setOptions, toString, useFilter, wekaStaticWrapper
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
makeCopy
public Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class Filter
Capabilities
public DictionaryBuilder getDictionaryHandler()
public void setDictionarySource(java.io.InputStream source)
source
- the input stream to read the dictionary frompublic void setDictionarySource(java.io.Reader source)
source
- the input reader to read the dictionary from@OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(java.io.File file)
file
- the file to read frompublic java.io.File getDictionaryFile()
@OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary)
binary
- true if the dictionary is a binary serialized onepublic boolean getDictionaryIsBinary()
public boolean getOutputWordCounts()
@OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String getAttributeIndices()
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(java.lang.String rangeList)
rangeList
- a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException
- if an invalid range list is suppliedpublic boolean getInvertSelection()
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
invert
- the new invert settingpublic java.lang.String getAttributeNamePrefix()
@OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public boolean getTFTransform()
@OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform)
TFTransform
- true if word frequencies are to be transformed.public boolean getIDFTransform()
@OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform)
IDFTransform
- true if the word frequecies are to be transformed@OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize)
normalize
- the new type.public boolean getNormalizeDocLength()
public boolean getLowerCaseTokens()
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are to be
formed.@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public Stemmer getStemmer()
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
value
- the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String globalInfo()
SimpleFilter
globalInfo
in class SimpleFilter
public void setEnvironment(Environment env)
EnvironmentHandler
setEnvironment
in interface EnvironmentHandler
env
- the environment variables to
usepublic static void main(java.lang.String[] args)