public class DictionaryBuilder extends java.lang.Object implements Aggregateable<DictionaryBuilder>, OptionHandler, java.io.Serializable
term [,doc_count]where
doc_countis the number of documents that the term has occurred in.
Constructor and Description |
---|
DictionaryBuilder() |
Modifier and Type | Method and Description |
---|---|
DictionaryBuilder |
aggregate(DictionaryBuilder toAgg)
Aggregate an object with this one
|
java.lang.String |
attributeIndicesTipText()
Returns the tip text for this property.
|
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property.
|
java.lang.String |
doNotOperateOnPerClassBasisTipText()
Returns the tip text for this property.
|
void |
finalizeAggregation()
Call to complete the aggregation process.
|
java.util.Map<java.lang.String,int[]> |
finalizeDictionary()
Performs final pruning and consolidation according to the number of words
to keep property.
|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix.
|
double |
getAverageDocLength()
Get the average document length to use when normalizing
|
java.util.Map<java.lang.String,int[]>[] |
getDictionaries(boolean minFrequencyPrune)
Get the current dictionary(s) (one per class for nominal class, if set).
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
Instances |
getInputFormat()
Gets the currently set input format
|
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
boolean |
getNormalize()
Get whether word frequencies for a document should be normalized
|
java.lang.String[] |
getOptions()
Gets the current settings of the DictionaryBuilder
|
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
long |
getPeriodicPruning()
Gets the rate (number of instances) at which the dictionary is periodically
pruned.
|
Range |
getSelectedRange()
Get the value of m_SelectedRange.
|
boolean |
getSortDictionary()
Get whether to keep the dictionary sorted alphabetically as it is built.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
Instances |
getVectorizedFormat()
Get the output format
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property.
|
java.lang.String |
invertSelectionTipText()
Returns the tip text for this property.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
void |
loadDictionary(java.io.File toLoad,
boolean plainText)
Load a dictionary from a file
|
void |
loadDictionary(java.io.InputStream is)
Load a binary dictionary from an input stream
|
void |
loadDictionary(java.io.Reader reader)
Load a textual dictionary from a reader
|
void |
loadDictionary(java.lang.String filename,
boolean plainText)
Load a dictionary from a file
|
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property.
|
java.lang.String |
minTermFreqTipText()
Returns the tip text for this property.
|
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property.
|
java.lang.String |
normalizeTipText()
Tip text for this property
|
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningTipText()
Returns the tip text for this property.
|
void |
processInstance(Instance inst)
Process an instance by tokenizing string attributes and updating the
dictionary.
|
boolean |
readyToVectorize()
Returns true if this DictionaryBuilder is ready to vectorize incoming
instances
|
void |
reset()
Clear the dictionary(s)
|
void |
saveDictionary(java.io.File toSave,
boolean plainText)
Save a dictionary
|
void |
saveDictionary(java.io.OutputStream os)
Save the dictionary in binary form
|
void |
saveDictionary(java.lang.String filename,
boolean plainText)
Save the dictionary
|
void |
saveDictionary(java.io.Writer writer)
Save the dictionary in textual format
|
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setAttributeIndicesArray(int[] attributes)
Sets which attributes are to be processed.
|
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.
|
void |
setAverageDocLength(double averageDocLength)
Set the average document length to use when normalizing
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed
into:
fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setNormalize(boolean n)
Set whether word frequencies for a document should be normalized
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or
word counts.
|
void |
setPeriodicPruning(long newPeriodicPruning)
Sets the rate (number of instances) at which the dictionary is periodically
pruned
|
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.
|
void |
setSortDictionary(boolean sortDictionary)
Set whether to keep the dictionary sorted alphabetically as it is built.
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij)
where fij is the frequency of word i in document(instance) j.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setup(Instances inputFormat) |
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
sortDictionaryTipText()
Tip text for this property
|
java.lang.String |
stemmerTipText()
Returns the tip text for this property.
|
java.lang.String |
stopwordsHandlerTipText()
Returns the tip text for this property.
|
java.lang.String |
TFTransformTipText()
Returns the tip text for this property.
|
java.lang.String |
tokenizerTipText()
Returns the tip text for this property.
|
Instances |
vectorizeBatch(Instances batch,
boolean setAvgDocLength)
Convert a batch of instances
|
Instance |
vectorizeInstance(Instance input)
Convert an input instance.
|
Instance |
vectorizeInstance(Instance input,
boolean retainStringAttValuesInMemory)
Convert an input instance.
|
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
makeCopy
@ProgrammaticProperty public void setAverageDocLength(double averageDocLength)
averageDocLength
- the average document length to usepublic double getAverageDocLength()
public java.lang.String sortDictionaryTipText()
public void setSortDictionary(boolean sortDictionary)
sortDictionary
- true to keep the dictionary sorted alphabeticallypublic boolean getSortDictionary()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String outputWordCountsTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange
- Value to assign to m_SelectedRange.public java.lang.String attributeIndicesTipText()
public java.lang.String getAttributeIndices()
public void setAttributeIndices(java.lang.String rangeList)
rangeList
- a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException
- if an invalid range list is suppliedpublic void setAttributeIndicesArray(int[] attributes)
attributes
- an array containing indexes of attributes to process.
Since the array will typically come from a program, attributes are
indexed from 0.java.lang.IllegalArgumentException
- if an invalid set of ranges is suppliedpublic java.lang.String invertSelectionTipText()
public boolean getInvertSelection()
public void setInvertSelection(boolean invert)
invert
- the new invert settingpublic int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output vector (per
class if assigned).public java.lang.String wordsToKeepTipText()
public long getPeriodicPruning()
public void setPeriodicPruning(long newPeriodicPruning)
newPeriodicPruning
- the rate at which the dictionary is periodically
prunedpublic java.lang.String periodicPruningTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
TFTransform
- true if word frequencies are to be transformed.public java.lang.String TFTransformTipText()
public java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
IDFTransform
- true if the word frequecies are to be transformedpublic java.lang.String IDFTransformTipText()
public boolean getNormalize()
public void setNormalize(boolean n)
n
- true if word frequencies should be normalizedpublic java.lang.String normalizeTipText()
public java.lang.String normalizeDocLengthTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are to be
formed.public java.lang.String lowerCaseTokensTipText()
public java.lang.String doNotOperateOnPerClassBasisTipText()
public boolean getDoNotOperateOnPerClassBasis()
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis
value.public java.lang.String minTermFreqTipText()
public int getMinTermFreq()
public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq
- The new MinTermFreq value.public Stemmer getStemmer()
public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public java.lang.String stemmerTipText()
public StopwordsHandler getStopwordsHandler()
public void setStopwordsHandler(StopwordsHandler value)
value
- the stopwords handler, if null, Null is usedpublic java.lang.String stopwordsHandlerTipText()
public Tokenizer getTokenizer()
public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic java.lang.String tokenizerTipText()
public java.util.Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
setOptions
in interface OptionHandler
options
- the list of options as an array of stringsjava.lang.Exception
- if an option is not supportedpublic void setup(Instances inputFormat) throws java.lang.Exception
java.lang.Exception
public Instances getInputFormat()
public boolean readyToVectorize()
public Instances getVectorizedFormat() throws java.lang.Exception
java.lang.Exception
- if there is no input format set and/or the dictionary has
not been constructed yet.public Instances vectorizeBatch(Instances batch, boolean setAvgDocLength) throws java.lang.Exception
batch
- the batch to convert.setAvgDocLength
- true to compute and set the average document length
for this DictionaryBuilder from the batch - this uses the final
pruned dictionary when computing doc lengths. When vectorizing
non-training batches, and normalization has been turned on, this
should be set to false.java.lang.Exception
- if there is no input format set and/or the dictionary has
not been constructed yet.public Instance vectorizeInstance(Instance input) throws java.lang.Exception
input
- the input instancejava.lang.Exception
- if there is no input format set and/or the dictionary has
not been constructed yet.public Instance vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory) throws java.lang.Exception
input
- the input instanceretainStringAttValuesInMemory
- true if the values of string
attributes not being vectorized should be retained in memoryjava.lang.Exception
- if there is no input format set and/or the dictionary has
not been constructed yetpublic void processInstance(Instance inst)
inst
- the instance to processpublic void reset()
public java.util.Map<java.lang.String,int[]>[] getDictionaries(boolean minFrequencyPrune) throws WekaException
minFrequencyPrune
- prune the dictionaries of low frequency terms
before returning themWekaException
public DictionaryBuilder aggregate(DictionaryBuilder toAgg) throws java.lang.Exception
Aggregateable
aggregate
in interface Aggregateable<DictionaryBuilder>
toAgg
- the object to aggregatejava.lang.Exception
- if the supplied object can't be aggregated for some
reasonpublic void finalizeAggregation() throws java.lang.Exception
Aggregateable
finalizeAggregation
in interface Aggregateable<DictionaryBuilder>
java.lang.Exception
- if the aggregation can't be finalized for some reasonpublic java.util.Map<java.lang.String,int[]> finalizeDictionary() throws java.lang.Exception
java.lang.Exception
- if a problem occurspublic void loadDictionary(java.lang.String filename, boolean plainText) throws java.io.IOException
filename
- the file to load fromplainText
- true if the dictionary is in text formatjava.io.IOException
- if a problem occurspublic void loadDictionary(java.io.File toLoad, boolean plainText) throws java.io.IOException
toLoad
- the file to load fromplainText
- true if the dictionary is in text formatjava.io.IOException
- if a problem occurspublic void loadDictionary(java.io.Reader reader) throws java.io.IOException
reader
- the reader to read fromjava.io.IOException
- if a problem occurspublic void loadDictionary(java.io.InputStream is) throws java.io.IOException
is
- the input stream to read fromjava.io.IOException
- if a problem occurspublic void saveDictionary(java.lang.String filename, boolean plainText) throws java.io.IOException
filename
- the file to save toplainText
- true if the dictionary should be saved in text formatjava.io.IOException
- if a problem occurspublic void saveDictionary(java.io.File toSave, boolean plainText) throws java.io.IOException
toSave
- the file to save toplainText
- true if the dictionary should be saved in text formatjava.io.IOException
- if a problem occurspublic void saveDictionary(java.io.Writer writer) throws java.io.IOException
writer
- the writer to write tojava.io.IOException
- if a problem occurspublic void saveDictionary(java.io.OutputStream os) throws java.io.IOException
os
- the output stream to write tojava.io.IOException
- if a problem occurs