FixedDictionaryStringToWordVector

java.lang.Object
- weka.filters.Filter
- - weka.filters.SimpleFilter
  - - weka.filters.SimpleStreamFilter
    - - weka.filters.unsupervised.attribute.FixedDictionaryStringToWordVector

All Implemented Interfaces:: java.io.Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, EnvironmentHandler, OptionHandler, RevisionHandler, WeightedInstancesHandler, StreamableFilter, UnsupervisedFilter

public class FixedDictionaryStringToWordVector
extends SimpleStreamFilter
implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler

Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is taken from a user-supplied dictionary, either in plain text form or as a serialized java object.

Valid options are:

  -dictionary <path to dictionary file>
  The path to the dictionary to use

  -binary-dict
  Dictionary file contains a binary serialized dictionary

  -C
  Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word

  -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.

  -V
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed

  -P <attribute name prefix>
  Specify a prefix for the created attribute names (default: "")

  -T
  Set whether the word frequencies should be transformed into
  log(1+fij), where fij is the frequency of word i in document (instance) j.

  -I
  Set whether the word frequencies in a document should be transformed into
  fij*log(num of Docs/num of docs with word i), where fij is the frequency
  of word i in document (instance) j.

  -N
  Whether to normalize to average length of documents seen during dictionary construction

  -L
  Convert all tokens to lowercase when matching against dictionary entries.

  -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.

  -stopwords-handler <spec>
  The stopwords handler to use (default = Null)

  -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

  -output-debug-info
  If set, filter is run in debug mode and
  may output additional info to the console

  -do-not-check-capabilities
  If set, filter capabilities are not checked before filter is built
  (use with caution).

Version:: $Revision: 15574 $
Author:: Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:: Serialized Form

Constructor Summary

Constructors
Constructor and Description

FixedDictionaryStringToWordVector()

Constructors
Constructor and Description
`FixedDictionaryStringToWordVector()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.lang.String`	`getAttributeIndices()` Gets the current range selection.
`java.lang.String`	`getAttributeNamePrefix()` Get the attribute name prefix.
`Capabilities`	`getCapabilities()` Returns the Capabilities of this filter.
`java.io.File`	`getDictionaryFile()` Get the dictionary file to read from
`DictionaryBuilder`	`getDictionaryHandler()` Get the dictionary builder used to manage the dictionary and perform the actual vectorization
`boolean`	`getDictionaryIsBinary()`
`boolean`	`getIDFTransform()` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`boolean`	`getInvertSelection()` Gets whether the supplied columns are to be processed or skipped.
`boolean`	`getLowerCaseTokens()` Gets whether if the tokens are to be downcased or not.
`boolean`	`getNormalizeDocLength()` Gets whether if the word frequencies for a document (instance) should be normalized or not.
`boolean`	`getOutputWordCounts()` Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
`Stemmer`	`getStemmer()` Returns the current stemming algorithm, null if none is used.
`StopwordsHandler`	`getStopwordsHandler()` Gets the stopwords handler.
`boolean`	`getTFTransform()` Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`Tokenizer`	`getTokenizer()` Returns the current tokenizer algorithm.
`java.lang.String`	`globalInfo()` Returns a string describing this filter.
`static void`	`main(java.lang.String[] args)`
`void`	`setAttributeIndices(java.lang.String rangeList)` Sets which attributes are to be worked on.
`void`	`setAttributeNamePrefix(java.lang.String newPrefix)` Set the attribute name prefix.
`void`	`setDictionaryFile(java.io.File file)` Set the dictionary file to read from
`void`	`setDictionaryIsBinary(boolean binary)` Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one
`void`	`setDictionarySource(java.io.InputStream source)` Set an input stream to load a binary serialized dictionary from, rather than source it from a file
`void`	`setDictionarySource(java.io.Reader source)` Set an input reader to load a textual dictionary from, rather than source it from a file
`void`	`setEnvironment(Environment env)` Set environment variables to use.
`void`	`setIDFTransform(boolean IDFTransform)` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`void`	`setInvertSelection(boolean invert)` Sets whether selected columns should be processed or skipped.
`void`	`setLowerCaseTokens(boolean downCaseTokens)` Sets whether if the tokens are to be downcased or not.
`void`	`setNormalizeDocLength(boolean normalize)` Sets whether if the word frequencies for a document (instance) should be normalized or not.
`void`	`setOutputWordCounts(boolean outputWordCounts)` Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
`void`	`setStemmer(Stemmer value)` the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
`void`	`setStopwordsHandler(StopwordsHandler value)` Sets the stopwords handler to use.
`void`	`setTFTransform(boolean TFTransform)` Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`void`	`setTokenizer(Tokenizer value)` the tokenizer algorithm to use.

Methods inherited from class weka.filters.SimpleStreamFilter
batchFinished, input

Methods inherited from class weka.filters.SimpleFilter
setInputFormat

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface weka.core.OptionHandler
makeCopy

Constructor Detail
- FixedDictionaryStringToWordVector
```
public FixedDictionaryStringToWordVector()
```

Method Detail

getCapabilities
```
public Capabilities getCapabilities()
```
Returns the Capabilities of this filter.

Specified by:

getCapabilities in interface CapabilitiesHandler

Overrides:

getCapabilities in class Filter

Returns:

the capabilities of this object

See Also:

Capabilities

getDictionaryHandler
```
public DictionaryBuilder getDictionaryHandler()
```
Get the dictionary builder used to manage the dictionary and perform the actual vectorization

Returns:

the DictionaryBuilder in use

setDictionarySource
```
public void setDictionarySource(java.io.InputStream source)
```
Set an input stream to load a binary serialized dictionary from, rather than source it from a file

Parameters:

source - the input stream to read the dictionary from

setDictionarySource
```
public void setDictionarySource(java.io.Reader source)
```
Set an input reader to load a textual dictionary from, rather than source it from a file

Parameters:

source - the input reader to read the dictionary from

setDictionaryFile

@OptionMetadata(displayName="Dictionary file",
                description="The path to the dictionary to use",
                commandLineParamName="dictionary",
                commandLineParamSynopsis="-dictionary <path to dictionary file>",
                displayOrder=1)
 @FilePropertyMetadata(fileChooserDialogType=0,
                      directoriesOnly=false)
public void setDictionaryFile(java.io.File file)

Set the dictionary file to read from

Parameters:: file - the file to read from

getDictionaryFile
```
public java.io.File getDictionaryFile()
```
Get the dictionary file to read from

Returns:

the dictionary file to read from

setDictionaryIsBinary

@OptionMetadata(displayName="Dictionary is binary",
                description="Dictionary file contains a binary serialized dictionary",
                commandLineParamName="binary-dict",
                commandLineParamSynopsis="-binary-dict",
                commandLineParamIsFlag=true,
                displayOrder=2)
public void setDictionaryIsBinary(boolean binary)

Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one

Parameters:: binary - true if the dictionary is a binary serialized one

getDictionaryIsBinary

public boolean getDictionaryIsBinary()

getOutputWordCounts
```
public boolean getOutputWordCounts()
```
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:

true if word counts should be output.

setOutputWordCounts

@OptionMetadata(displayName="Output word counts",
                description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word",
                commandLineParamName="C",
                commandLineParamSynopsis="-C",
                commandLineParamIsFlag=true,
                displayOrder=3)
public void setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:: outputWordCounts - true if word counts should be output.

getAttributeIndices
```
public java.lang.String getAttributeIndices()
```
Gets the current range selection.

Returns:

a string containing a comma separated list of ranges

setAttributeIndices

@OptionMetadata(displayName="Range of attributes to operate on",
                description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.",
                commandLineParamName="R",
                commandLineParamSynopsis="-R <range>",
                displayOrder=4)
public void setAttributeIndices(java.lang.String rangeList)

Sets which attributes are to be worked on.

Parameters:: rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last
Throws:: java.lang.IllegalArgumentException - if an invalid range list is supplied

getInvertSelection
```
public boolean getInvertSelection()
```
Gets whether the supplied columns are to be processed or skipped.

Returns:

true if the supplied columns will be kept

setInvertSelection

@OptionMetadata(displayName="Invert selection",
                description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed",
                commandLineParamName="V",
                commandLineParamSynopsis="-V",
                commandLineParamIsFlag=true,
                displayOrder=5)
public void setInvertSelection(boolean invert)

Sets whether selected columns should be processed or skipped.

Parameters:: invert - the new invert setting

getAttributeNamePrefix
```
public java.lang.String getAttributeNamePrefix()
```
Get the attribute name prefix.

Returns:

The current attribute name prefix.

setAttributeNamePrefix

@OptionMetadata(displayName="Prefix for created attribute names",
                description="Specify a prefix for the created attribute names (default: \"\")",
                commandLineParamName="P",
                commandLineParamSynopsis="-P <attribute name prefix>",
                displayOrder=6)
public void setAttributeNamePrefix(java.lang.String newPrefix)

Set the attribute name prefix.

Parameters:: newPrefix - String to use as the attribute name prefix.

getTFTransform
```
public boolean getTFTransform()
```
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Returns:

true if word frequencies are to be transformed.

setTFTransform

@OptionMetadata(displayName="TFT transform",
                description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.",
                commandLineParamName="T",
                commandLineParamSynopsis="-T",
                commandLineParamIsFlag=true,
                displayOrder=7)
public void setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Parameters:: TFTransform - true if word frequencies are to be transformed.

getIDFTransform
```
public boolean getIDFTransform()
```
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Returns:

true if the word frequencies are to be transformed.

setIDFTransform

@OptionMetadata(displayName="IDF transform",
                description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.",
                commandLineParamName="I",
                commandLineParamSynopsis="-I",
                commandLineParamIsFlag=true,
                displayOrder=8)
public void setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Parameters:: IDFTransform - true if the word frequecies are to be transformed

setNormalizeDocLength

@OptionMetadata(displayName="Normalize word frequencies",
                description="Whether to normalize to average length of documents seen during dictionary construction",
                commandLineParamName="N",
                commandLineParamSynopsis="-N",
                commandLineParamIsFlag=true,
                displayOrder=9)
public void setNormalizeDocLength(boolean normalize)

Sets whether if the word frequencies for a document (instance) should be normalized or not.

Parameters:: normalize - the new type.

getNormalizeDocLength
```
public boolean getNormalizeDocLength()
```
Gets whether if the word frequencies for a document (instance) should be normalized or not.

Returns:

true if word frequencies are to be normalized.

getLowerCaseTokens
```
public boolean getLowerCaseTokens()
```
Gets whether if the tokens are to be downcased or not.

Returns:

true if the tokens are to be downcased.

setLowerCaseTokens

@OptionMetadata(displayName="Lower case tokens",
                description="Convert all tokens to lowercase when matching against dictionary entries.",
                commandLineParamName="L",
                commandLineParamSynopsis="-L",
                commandLineParamIsFlag=true,
                displayOrder=10)
public void setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:: downCaseTokens - should be true if only lower case tokens are to be formed.

setStemmer

@OptionMetadata(displayName="Stemmer to use",
                description="The stemming algorithm (classname plus parameters) to use.",
                commandLineParamName="stemmer",
                commandLineParamSynopsis="-stemmer <spec>",
                displayOrder=11)
public void setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

Parameters:: value - the configured stemming algorithm, or null
See Also:: NullStemmer

getStemmer
```
public Stemmer getStemmer()
```
Returns the current stemming algorithm, null if none is used.

Returns:

the current stemming algorithm, null if none set

setStopwordsHandler

@OptionMetadata(displayName="Stop words handler",
                description="The stopwords handler to use (default = Null)",
                commandLineParamName="stopwords-handler",
                commandLineParamSynopsis="-stopwords-handler <spec>",
                displayOrder=12)
public void setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

Parameters:: value - the stopwords handler, if null, Null is used

getStopwordsHandler
```
public StopwordsHandler getStopwordsHandler()
```
Gets the stopwords handler.

Returns:

the stopwords handler

setTokenizer

@OptionMetadata(displayName="Tokenizer",
                description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)",
                commandLineParamName="tokenizer",
                commandLineParamSynopsis="-tokenizer <spec>",
                displayOrder=13)
public void setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

Parameters:: value - the configured tokenizing algorithm

getTokenizer
```
public Tokenizer getTokenizer()
```
Returns the current tokenizer algorithm.

Returns:

the current tokenizer algorithm

globalInfo
```
public java.lang.String globalInfo()
```
Description copied from class: SimpleFilter

Returns a string describing this filter.

Specified by:

globalInfo in class SimpleFilter

Returns:

a description of the filter suitable for displaying in the explorer/experimenter gui

setEnvironment
```
public void setEnvironment(Environment env)
```
Description copied from interface: EnvironmentHandler

Set environment variables to use.

Specified by:

setEnvironment in interface EnvironmentHandler

Parameters:

env - the environment variables to use

main

public static void main(java.lang.String[] args)

Class FixedDictionaryStringToWordVector

Constructor Summary

Method Summary

Methods inherited from class weka.filters.SimpleStreamFilter

Methods inherited from class weka.filters.SimpleFilter

Methods inherited from class weka.filters.Filter

Methods inherited from class java.lang.Object

Methods inherited from interface weka.core.OptionHandler

Constructor Detail

FixedDictionaryStringToWordVector

Method Detail

getCapabilities

getDictionaryHandler

setDictionarySource

setDictionarySource

setDictionaryFile

getDictionaryFile

setDictionaryIsBinary

getDictionaryIsBinary

getOutputWordCounts

setOutputWordCounts

getAttributeIndices

setAttributeIndices

getInvertSelection

setInvertSelection

getAttributeNamePrefix

setAttributeNamePrefix

getTFTransform

setTFTransform

getIDFTransform

setIDFTransform

setNormalizeDocLength

getNormalizeDocLength

getLowerCaseTokens

setLowerCaseTokens

setStemmer

getStemmer

setStopwordsHandler

getStopwordsHandler

setTokenizer

getTokenizer

globalInfo

setEnvironment

main