public class StringToWordVector extends Filter implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
| Modifier and Type | Field and Description |
|---|---|
static int |
FILTER_NONE
normalization: No normalization.
|
static int |
FILTER_NORMALIZE_ALL
normalization: Normalize all data.
|
static int |
FILTER_NORMALIZE_TEST_ONLY
normalization: Normalize test data only.
|
static Tag[] |
TAGS_FILTER
Specifies whether document's (instance's) word frequencies are to be
normalized.
|
| Constructor and Description |
|---|
StringToWordVector()
Default constructor.
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the
output.
|
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
attributeIndicesTipText()
Returns the tip text for this property.
|
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property.
|
boolean |
batchFinished()
Signify that this batch of input to the filter is finished.
|
java.lang.String |
dictionaryFileToSaveToTipText()
Tip text for this property
|
java.lang.String |
doNotOperateOnPerClassBasisTipText()
Returns the tip text for this property.
|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this filter.
|
java.io.File |
getDictionaryFileToSaveTo()
Set the dictionary file to save the dictionary to.
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
SelectedTag |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be
normalized or not.
|
java.lang.String[] |
getOptions()
Gets the current settings of the filter.
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
double |
getPeriodicPruning()
Gets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
java.lang.String |
getRevision()
Returns the revision string.
|
boolean |
getSaveDictionaryInBinaryForm()
Set whether to save the dictionary in binary serialized form rather than
as plain text
|
Range |
getSelectedRange()
Get the value of m_SelectedRange.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
globalInfo()
Returns a string describing this filter.
|
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property.
|
boolean |
input(Instance instance)
Input an instance for filtering.
|
java.lang.String |
invertSelectionTipText()
Returns the tip text for this property.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property.
|
static void |
main(java.lang.String[] argv)
Main method for testing this class.
|
java.lang.String |
minTermFreqTipText()
Returns the tip text for this property.
|
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property.
|
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningTipText()
Returns the tip text for this property.
|
java.lang.String |
saveDictionaryInBinaryFormTipText() |
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeIndicesArray(int[] attributes)
Sets which attributes are to be processed.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setDictionaryFileToSaveTo(java.io.File toSaveTo)
Set the dictionary file to save the dictionary to.
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances.
|
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setNormalizeDocLength(SelectedTag newType)
Sets whether if the word frequencies for a document (instance) should be
normalized or not.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
void |
setPeriodicPruning(double newPeriodicPruning)
Sets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
void |
setSaveDictionaryInBinaryForm(boolean saveAsBinary)
Set whether to save the dictionary in binary serialized form rather than
as plain text
|
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
stemmerTipText()
Returns the tip text for this property.
|
java.lang.String |
stopwordsHandlerTipText()
Returns the tip text for this property.
|
java.lang.String |
TFTransformTipText()
Returns the tip text for this property.
|
java.lang.String |
tokenizerTipText()
Returns the tip text for this property.
|
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property.
|
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOutputFormat, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, toString, useFilter, wekaStaticWrapperequals, getClass, hashCode, notify, notifyAll, wait, wait, waitmakeCopypublic static final int FILTER_NONE
public static final int FILTER_NORMALIZE_ALL
public static final int FILTER_NORMALIZE_TEST_ONLY
public static final Tag[] TAGS_FILTER
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep - the number of words in the output vector (per class if
assigned).public java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class Filterpublic void setOptions(java.lang.String[] options)
throws java.lang.Exception
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
setOptions in interface OptionHandlersetOptions in class Filteroptions - the list of options as an array of stringsjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class Filterpublic Capabilities getCapabilities()
getCapabilities in interface CapabilitiesHandlergetCapabilities in class FilterCapabilitiespublic boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat in class FilterinstanceInfo - an Instances object containing the input instance
structure (any instances contained in the object are ignored -
only the structure is required).java.lang.Exception - if the input format can't be set successfullypublic boolean input(Instance instance) throws java.lang.Exception
input in class Filterinstance - the input instance.java.lang.IllegalStateException - if no input structure has been defined.java.lang.NullPointerException - if the input format has not been defined.java.lang.Exception - if the input instance was not of the correct format or if
there was a problem with the filtering.public boolean batchFinished()
throws java.lang.Exception
batchFinished in class Filterjava.lang.IllegalStateException - if no input structure has been defined.java.lang.NullPointerException - if no input structure has been defined,java.lang.Exception - if there was a problem finishing the batch.public java.lang.String dictionaryFileToSaveToTipText()
public void setDictionaryFileToSaveTo(java.io.File toSaveTo)
toSaveTo - the path to save the dictionary topublic java.io.File getDictionaryFileToSaveTo()
public java.lang.String saveDictionaryInBinaryFormTipText()
public void setSaveDictionaryInBinaryForm(boolean saveAsBinary)
saveAsBinary - true to save the dictionary in binary formpublic boolean getSaveDictionaryInBinaryForm()
public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts - true if word counts should be output.public java.lang.String outputWordCountsTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange - Value to assign to m_SelectedRange.public java.lang.String attributeIndicesTipText()
public java.lang.String getAttributeIndices()
public void setAttributeIndices(java.lang.String rangeList)
rangeList - a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException - if an invalid range list is suppliedpublic void setAttributeIndicesArray(int[] attributes)
attributes - an array containing indexes of attributes to process.
Since the array will typically come from a program, attributes are
indexed from 0.java.lang.IllegalArgumentException - if an invalid set of ranges is suppliedpublic java.lang.String invertSelectionTipText()
public boolean getInvertSelection()
public void setInvertSelection(boolean invert)
invert - the new invert settingpublic java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix - String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep - the target number of words in the output vector (per
class if assigned).public java.lang.String wordsToKeepTipText()
public double getPeriodicPruning()
public void setPeriodicPruning(double newPeriodicPruning)
newPeriodicPruning - the rate at which the dictionary is periodically
prunedpublic java.lang.String periodicPruningTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
TFTransform - true if word frequencies are to be transformed.public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
IDFTransform - true if the word frequecies are to be transformedpublic java.lang.String IDFTransformTipText()
public SelectedTag getNormalizeDocLength()
public void setNormalizeDocLength(SelectedTag newType)
newType - the new type.public java.lang.String normalizeDocLengthTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens - should be true if only lower case tokens are to be
formed.public java.lang.String doNotOperateOnPerClassBasisTipText()
public boolean getDoNotOperateOnPerClassBasis()
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis
value.public java.lang.String minTermFreqTipText()
public int getMinTermFreq()
public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq - The new MinTermFreq value.public java.lang.String lowerCaseTokensTipText()
public void setStemmer(Stemmer value)
value - the configured stemming algorithm, or nullNullStemmerpublic Stemmer getStemmer()
public java.lang.String stemmerTipText()
public void setStopwordsHandler(StopwordsHandler value)
value - the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
public java.lang.String stopwordsHandlerTipText()
public void setTokenizer(Tokenizer value)
value - the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String tokenizerTipText()
public java.lang.String getRevision()
getRevision in interface RevisionHandlergetRevision in class Filterpublic static void main(java.lang.String[] argv)
argv - should contain arguments to the filter: use -h for help