public class StringToWordVector extends Filter implements UnsupervisedFilter, OptionHandler
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-S Ignore words that are in the stoplist.
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-stopwords <file> A file containing stopwords to override the default ones. Using this option automatically sets the flag ('-S') to use the stoplist if the file exists. Format: one stopword per line, lines starting with '#' are interpreted as comments and ignored.
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
Stopwords
,
Serialized FormModifier and Type | Field and Description |
---|---|
static int |
FILTER_NONE
normalization: No normalization.
|
static int |
FILTER_NORMALIZE_ALL
normalization: Normalize all data.
|
static int |
FILTER_NORMALIZE_TEST_ONLY
normalization: Normalize test data only.
|
static Tag[] |
TAGS_FILTER
Specifies whether document's (instance's) word frequencies are
to be normalized.
|
Constructor and Description |
---|
StringToWordVector()
Default constructor.
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words
in the output.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
attributeIndicesTipText()
Returns the tip text for this property.
|
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property.
|
boolean |
batchFinished()
Signify that this batch of input to the filter is finished.
|
java.lang.String |
doNotOperateOnPerClassBasisTipText()
Returns the tip text for this property.
|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this filter.
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
SelectedTag |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should
be normalized or not.
|
java.lang.String[] |
getOptions()
Gets the current settings of the filter.
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word
presence, or word counts.
|
double |
getPeriodicPruning()
Gets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
java.lang.String |
getRevision()
Returns the revision string.
|
Range |
getSelectedRange()
Get the value of m_SelectedRange.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
java.io.File |
getStopwords()
returns the file used for obtaining the stopwords, if the file represents
a directory then the default ones are used.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into
log(1+fij) where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
boolean |
getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist
is in weka.core.StopWords).
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute
assigned) to attempt to keep.
|
java.lang.String |
globalInfo()
Returns a string describing this filter.
|
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property.
|
boolean |
input(Instance instance)
Input an instance for filtering.
|
java.lang.String |
invertSelectionTipText()
Returns the tip text for this property.
|
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options.
|
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property.
|
static void |
main(java.lang.String[] argv)
Main method for testing this class.
|
java.lang.String |
minTermFreqTipText()
Returns the tip text for this property.
|
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property.
|
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningTipText()
Returns the tip text for this property.
|
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeIndicesArray(int[] attributes)
Sets which attributes are to be processed.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances.
|
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setNormalizeDocLength(SelectedTag newType)
Sets whether if the word frequencies for a document (instance) should
be normalized or not.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word
presence, or word counts.
|
void |
setPeriodicPruning(double newPeriodicPruning)
Sets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwords(java.io.File value)
sets the file containing the stopwords, null or a directory unset the
stopwords.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into
log(1+fij) where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The
stop list is in weka.core.StopWords).
|
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute
assigned) to attempt to keep.
|
java.lang.String |
stemmerTipText()
Returns the tip text for this property.
|
java.lang.String |
stopwordsTipText()
Returns the tip text for this property.
|
java.lang.String |
TFTransformTipText()
Returns the tip text for this property.
|
java.lang.String |
tokenizerTipText()
Returns the tip text for this property.
|
java.lang.String |
useStoplistTipText()
Returns the tip text for this property.
|
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property.
|
batchFilterFile, filterFile, getCapabilities, getOutputFormat, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, numPendingOutput, output, outputPeek, toString, useFilter, wekaStaticWrapper
public static final int FILTER_NONE
public static final int FILTER_NORMALIZE_ALL
public static final int FILTER_NORMALIZE_TEST_ONLY
public static final Tag[] TAGS_FILTER
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep
- the number of words in the output vector (per class
if assigned).public java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-S Ignore words that are in the stoplist.
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-stopwords <file> A file containing stopwords to override the default ones. Using this option automatically sets the flag ('-S') to use the stoplist if the file exists. Format: one stopword per line, lines starting with '#' are interpreted as comments and ignored.
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
setOptions
in interface OptionHandler
options
- the list of options as an array of stringsjava.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class Filter
Capabilities
public boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat
in class Filter
instanceInfo
- an Instances object containing the input
instance structure (any instances contained in the object are
ignored - only the structure is required).java.lang.Exception
- if the input format can't be set
successfullypublic boolean input(Instance instance) throws java.lang.Exception
input
in class Filter
instance
- the input instance.java.lang.IllegalStateException
- if no input structure has been defined.java.lang.NullPointerException
- if the input format has not been
defined.java.lang.Exception
- if the input instance was not of the correct
format or if there was a problem with the filtering.public boolean batchFinished() throws java.lang.Exception
batchFinished
in class Filter
java.lang.IllegalStateException
- if no input structure has been defined.java.lang.NullPointerException
- if no input structure has been defined,java.lang.Exception
- if there was a problem finishing the batch.public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String outputWordCountsTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange
- Value to assign to m_SelectedRange.public java.lang.String attributeIndicesTipText()
public java.lang.String getAttributeIndices()
public void setAttributeIndices(java.lang.String rangeList)
rangeList
- a string representing the list of attributes. Since
the string will typically come from a user, attributes are indexed from
1. java.lang.IllegalArgumentException
- if an invalid range list is suppliedpublic void setAttributeIndicesArray(int[] attributes)
attributes
- an array containing indexes of attributes to process.
Since the array will typically come from a program, attributes are indexed
from 0.java.lang.IllegalArgumentException
- if an invalid set of ranges
is suppliedpublic java.lang.String invertSelectionTipText()
public boolean getInvertSelection()
public void setInvertSelection(boolean invert)
invert
- the new invert settingpublic java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output
vector (per class if assigned).public java.lang.String wordsToKeepTipText()
public double getPeriodicPruning()
public void setPeriodicPruning(double newPeriodicPruning)
newPeriodicPruning
- the rate at which the dictionary is periodically prunedpublic java.lang.String periodicPruningTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
TFTransform
- true if word frequencies are to be transformed.public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
IDFTransform
- true if the word frequecies are to be transformedpublic java.lang.String IDFTransformTipText()
public SelectedTag getNormalizeDocLength()
public void setNormalizeDocLength(SelectedTag newType)
newType
- the new type.public java.lang.String normalizeDocLengthTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are
to be formed.public java.lang.String doNotOperateOnPerClassBasisTipText()
public boolean getDoNotOperateOnPerClassBasis()
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis value.public java.lang.String minTermFreqTipText()
public int getMinTermFreq()
public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq
- The new MinTermFreq value.public java.lang.String lowerCaseTokensTipText()
public boolean getUseStoplist()
public void setUseStoplist(boolean useStoplist)
useStoplist
- true if the tokens that are on a stoplist are to be
ignored.public java.lang.String useStoplistTipText()
public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public Stemmer getStemmer()
public java.lang.String stemmerTipText()
public void setStopwords(java.io.File value)
value
- the file containing the stopwordspublic java.io.File getStopwords()
public java.lang.String stopwordsTipText()
public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String tokenizerTipText()
public java.lang.String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class Filter
public static void main(java.lang.String[] argv)
argv
- should contain arguments to the filter:
use -h for help