public class StringToWordVector extends Filter implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
Modifier and Type | Field and Description |
---|---|
static int |
FILTER_NONE
normalization: No normalization.
|
static int |
FILTER_NORMALIZE_ALL
normalization: Normalize all data.
|
static int |
FILTER_NORMALIZE_TEST_ONLY
normalization: Normalize test data only.
|
static Tag[] |
TAGS_FILTER
Specifies whether document's (instance's) word frequencies are to be
normalized.
|
Constructor and Description |
---|
StringToWordVector()
Default constructor.
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the
output.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
attributeIndicesTipText()
Returns the tip text for this property.
|
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property.
|
boolean |
batchFinished()
Signify that this batch of input to the filter is finished.
|
java.lang.String |
dictionaryFileToSaveToTipText()
Tip text for this property
|
java.lang.String |
doNotOperateOnPerClassBasisTipText()
Returns the tip text for this property.
|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this filter.
|
java.io.File |
getDictionaryFileToSaveTo()
Set the dictionary file to save the dictionary to.
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
SelectedTag |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be
normalized or not.
|
java.lang.String[] |
getOptions()
Gets the current settings of the filter.
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
double |
getPeriodicPruning()
Gets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
java.lang.String |
getRevision()
Returns the revision string.
|
boolean |
getSaveDictionaryInBinaryForm()
Set whether to save the dictionary in binary serialized form rather than
as plain text
|
Range |
getSelectedRange()
Get the value of m_SelectedRange.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
globalInfo()
Returns a string describing this filter.
|
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property.
|
boolean |
input(Instance instance)
Input an instance for filtering.
|
java.lang.String |
invertSelectionTipText()
Returns the tip text for this property.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property.
|
static void |
main(java.lang.String[] argv)
Main method for testing this class.
|
java.lang.String |
minTermFreqTipText()
Returns the tip text for this property.
|
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property.
|
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningTipText()
Returns the tip text for this property.
|
java.lang.String |
saveDictionaryInBinaryFormTipText() |
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeIndicesArray(int[] attributes)
Sets which attributes are to be processed.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setDictionaryFileToSaveTo(java.io.File toSaveTo)
Set the dictionary file to save the dictionary to.
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances.
|
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setNormalizeDocLength(SelectedTag newType)
Sets whether if the word frequencies for a document (instance) should be
normalized or not.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
void |
setPeriodicPruning(double newPeriodicPruning)
Sets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
void |
setSaveDictionaryInBinaryForm(boolean saveAsBinary)
Set whether to save the dictionary in binary serialized form rather than
as plain text
|
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
stemmerTipText()
Returns the tip text for this property.
|
java.lang.String |
stopwordsHandlerTipText()
Returns the tip text for this property.
|
java.lang.String |
TFTransformTipText()
Returns the tip text for this property.
|
java.lang.String |
tokenizerTipText()
Returns the tip text for this property.
|
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property.
|
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOutputFormat, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, toString, useFilter, wekaStaticWrapper
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
makeCopy
public static final int FILTER_NONE
public static final int FILTER_NORMALIZE_ALL
public static final int FILTER_NORMALIZE_TEST_ONLY
public static final Tag[] TAGS_FILTER
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep
- the number of words in the output vector (per class if
assigned).public java.util.Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
listOptions
in class Filter
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
setOptions
in interface OptionHandler
setOptions
in class Filter
options
- the list of options as an array of stringsjava.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class Filter
public Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class Filter
Capabilities
public boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat
in class Filter
instanceInfo
- an Instances object containing the input instance
structure (any instances contained in the object are ignored -
only the structure is required).java.lang.Exception
- if the input format can't be set successfullypublic boolean input(Instance instance) throws java.lang.Exception
input
in class Filter
instance
- the input instance.java.lang.IllegalStateException
- if no input structure has been defined.java.lang.NullPointerException
- if the input format has not been defined.java.lang.Exception
- if the input instance was not of the correct format or if
there was a problem with the filtering.public boolean batchFinished() throws java.lang.Exception
batchFinished
in class Filter
java.lang.IllegalStateException
- if no input structure has been defined.java.lang.NullPointerException
- if no input structure has been defined,java.lang.Exception
- if there was a problem finishing the batch.public java.lang.String dictionaryFileToSaveToTipText()
public void setDictionaryFileToSaveTo(java.io.File toSaveTo)
toSaveTo
- the path to save the dictionary topublic java.io.File getDictionaryFileToSaveTo()
public java.lang.String saveDictionaryInBinaryFormTipText()
public void setSaveDictionaryInBinaryForm(boolean saveAsBinary)
saveAsBinary
- true to save the dictionary in binary formpublic boolean getSaveDictionaryInBinaryForm()
public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String outputWordCountsTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange
- Value to assign to m_SelectedRange.public java.lang.String attributeIndicesTipText()
public java.lang.String getAttributeIndices()
public void setAttributeIndices(java.lang.String rangeList)
rangeList
- a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException
- if an invalid range list is suppliedpublic void setAttributeIndicesArray(int[] attributes)
attributes
- an array containing indexes of attributes to process.
Since the array will typically come from a program, attributes are
indexed from 0.java.lang.IllegalArgumentException
- if an invalid set of ranges is suppliedpublic java.lang.String invertSelectionTipText()
public boolean getInvertSelection()
public void setInvertSelection(boolean invert)
invert
- the new invert settingpublic java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output vector (per
class if assigned).public java.lang.String wordsToKeepTipText()
public double getPeriodicPruning()
public void setPeriodicPruning(double newPeriodicPruning)
newPeriodicPruning
- the rate at which the dictionary is periodically
prunedpublic java.lang.String periodicPruningTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
TFTransform
- true if word frequencies are to be transformed.public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
IDFTransform
- true if the word frequecies are to be transformedpublic java.lang.String IDFTransformTipText()
public SelectedTag getNormalizeDocLength()
public void setNormalizeDocLength(SelectedTag newType)
newType
- the new type.public java.lang.String normalizeDocLengthTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are to be
formed.public java.lang.String doNotOperateOnPerClassBasisTipText()
public boolean getDoNotOperateOnPerClassBasis()
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis
value.public java.lang.String minTermFreqTipText()
public int getMinTermFreq()
public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq
- The new MinTermFreq value.public java.lang.String lowerCaseTokensTipText()
public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public Stemmer getStemmer()
public java.lang.String stemmerTipText()
public void setStopwordsHandler(StopwordsHandler value)
value
- the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
public java.lang.String stopwordsHandlerTipText()
public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String tokenizerTipText()
public java.lang.String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class Filter
public static void main(java.lang.String[] argv)
argv
- should contain arguments to the filter: use -h for help