public class CSVToARFFHeaderMapTask
extends java.lang.Object
implements weka.core.OptionHandler, java.io.Serializable
Modifier and Type | Class and Description |
---|---|
static class |
CSVToARFFHeaderMapTask.HeaderAndQuantileDataHolder
Container class for a Instances header with basic summary stats and a map
of TDigest quantile estimators for numeric attributes
|
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
ARFF_SUMMARY_ATTRIBUTE_PREFIX
Attribute name prefix for a summary statistics attribute
|
static int |
MAX_PARSING_ERRORS |
Constructor and Description |
---|
CSVToARFFHeaderMapTask()
Constructor
|
CSVToARFFHeaderMapTask(boolean suppressQuantileOptions)
Constructor
|
CSVToARFFHeaderMapTask(boolean suppressQuantileOptions,
boolean suppressCSVParsingOptions)
Constructor
|
Modifier and Type | Method and Description |
---|---|
static CSVToARFFHeaderMapTask |
combine(java.util.List<CSVToARFFHeaderMapTask> tasks)
Performs a "combine" operation using the supplied partial
CSVToARFFHeaderMapTask tasks.
|
java.lang.String |
compressionLevelForQuartileEstimationTipText()
Returns the tip text for this property.
|
java.lang.String |
computeQuartilesAsPartOfSummaryStatsTipText()
Returns the tip text for this property.
|
java.lang.String |
dateAttributesTipText()
Returns the tip text for this property.
|
java.lang.String |
dateFormatTipText()
Returns the tip text for this property.
|
void |
deSerializeAllQuantileEstimators()
Deserialize all TDigest quantile estimators in use
|
java.lang.String |
enclosureCharactersTipText()
Returns the tip text for this property.
|
java.lang.String |
fieldSeparatorTipText()
Returns the tip text for this property.
|
void |
fromHeader(weka.core.Instances headerWithSummary,
java.util.Map<java.lang.String,TDigest> quantileEstimators)
Initialize internal state using the supplied ARFF header with summary
attributes.
|
void |
generateNames(int numAtts)
Generate attribute names.
|
void |
generateNames(int initial,
int numAtts)
Generate attribute names.
|
double |
getCompressionLevelForQuartileEstimation()
Get the compression level to use in the TDigest quantile estimators
|
boolean |
getComputeQuartilesAsPartOfSummaryStats()
Get whether to include estimated quartiles in the profiling stats
|
java.lang.String |
getDateAttributes()
Returns the current attribute range to be forced to type date.
|
java.lang.String |
getDateFormat()
Get the format to use for parsing date values.
|
java.lang.String |
getDefaultValue(int attIndex)
Get the default label for a given attribute.
|
java.lang.String |
getEnclosureCharacters()
Get the character(s) to use/recognize as string enclosures
|
java.lang.String |
getFieldSeparator()
Returns the character used as column separator.
|
weka.core.Instances |
getHeader()
get the header information (as an Instances object) from what has been seen
so far by this map task
|
weka.core.Instances |
getHeader(int numFields,
java.util.List<java.lang.String> attNames)
Get a header constructed using the supplied attribute names.
|
CSVToARFFHeaderMapTask.HeaderAndQuantileDataHolder |
getHeaderAndQuantileEstimators()
Get the header information and the encoded quantile estimators
|
java.lang.String |
getMissingValue()
Returns the current placeholder for missing values.
|
java.lang.String |
getNominalAttributes()
Returns the current attribute range to be forced to type nominal.
|
java.lang.Object[] |
getNominalDefaultLabelSpecs()
Get the default label specifications for nominal attributes
|
java.lang.Object[] |
getNominalLabelSpecs()
Get label specifications for nominal attributes.
|
int |
getNumDecimalPlaces()
Get the number of decimal places for outputting summary stats
|
java.lang.String[] |
getOptions() |
java.lang.String |
getStringAttributes()
Returns the current attribute range to be forced to type string.
|
boolean |
getTreatUnparsableNumericValuesAsMissing()
Get whether, for hitherto thought to be numeric columns, to treat any
unparsable values as missing value.
|
boolean |
getTreatZerosAsMissing()
Get whether to treat zeros as missing values for numeric attributes when
computing summary statistics.
|
boolean |
headerAvailableImmediately(int numFields,
java.util.List<java.lang.String> attNames,
java.lang.StringBuffer problems)
Check if the header can be produced immediately without having to do a
pre-processing pass to determine and unify nominal attribute values.
|
void |
initParserOnly(java.util.List<java.lang.String> attNames)
Only initialize enough stuff in order to parse rows and construct instances
|
static java.util.List<java.lang.String> |
instanceHeaderToAttributeNameList(weka.core.Instances header) |
java.util.Enumeration<weka.core.Option> |
listOptions() |
static void |
main(java.lang.String[] args) |
weka.core.Instance |
makeInstance(weka.core.Instances trainingHeader,
boolean setStringValues,
java.lang.String[] parsed)
Utility method for Constructing a dense instance given an array of parsed
CSV values
|
weka.core.Instance |
makeInstance(weka.core.Instances trainingHeader,
boolean setStringValues,
java.lang.String[] parsed,
boolean sparse)
Utility method for Constructing an instance given an array of parsed CSV
values
|
weka.core.Instance |
makeInstanceFromObjectRow(weka.core.Instances trainingHeader,
boolean setStringValues,
java.lang.Object[] row,
boolean sparse)
Utility method for Constructing an instance given an array of Objects
|
java.lang.String |
missingValueTipText()
Returns the tip text for this property.
|
java.lang.String |
nominalAttributesTipText()
Returns the tip text for this property.
|
java.lang.String |
nominalDefaultLabelSpecsTipText()
Returns the tip text for this property.
|
java.lang.String |
nominalLabelSpecsTipText()
Returns the tip text for this property.
|
java.lang.String[] |
parseRowOnly(java.lang.String row)
Just parse a row.
|
void |
processRow(java.lang.String row,
java.util.List<java.lang.String> attNames)
Process a row of data coming into the map.
|
void |
processRowValues(java.lang.Object[] fieldVals,
java.util.List<java.lang.String> attNames)
Process a tokenized row of values.
|
void |
serializeAllQuantileEstimators()
Serialize all TDigest quantile estimators in use
|
void |
setCompressionLevelForQuartileEstimation(double compression)
Set the compression level to use in the TDigest quantile estimators
|
void |
setComputeQuartilesAsPartOfSummaryStats(boolean c)
Set whether to include estimated quartiles in the profiling stats
|
void |
setDateAttributes(java.lang.String value)
Set the attribute range to be forced to type date.
|
void |
setDateFormat(java.lang.String value)
Set the format to use for parsing date values.
|
void |
setEnclosureCharacters(java.lang.String enclosure)
Set the character(s) to use/recognize as string enclosures
|
void |
setFieldSeparator(java.lang.String value)
Sets the character used as column separator.
|
void |
setMissingValue(java.lang.String value)
Sets the placeholder for missing values.
|
void |
setNominalAttributes(java.lang.String value)
Sets the attribute range to be forced to type nominal.
|
void |
setNominalDefaultLabelSpecs(java.lang.Object[] specs)
Set the default label specifications for nominal attributes
|
void |
setNominalLabelSpecs(java.lang.Object[] specs)
Set label specifications for nominal attributes.
|
void |
setNumDecimalPlaces(int numDecimalPlaces)
Set the number of decimal places for outputting summary stats
|
void |
setOptions(java.lang.String[] options) |
void |
setStringAttributes(java.lang.String value)
Sets the attribute range to be forced to type string.
|
void |
setTreatUnparsableNumericValuesAsMissing(boolean unparsableNumericValuesToMissing)
Set whether, for hitherto thought to be numeric columns, to treat any
unparsable values as missing value.
|
void |
setTreatZerosAsMissing(boolean t)
Set whether to treat zeros as missing values for numeric attributes when
computing summary statistics.
|
java.lang.String |
stringAttributesTipText()
Returns the tip text for this property.
|
static void |
updateSummaryStats(java.util.Map<java.lang.String,Stats> summaryStats,
java.util.Map<java.lang.String,StringStats> backupStringStats,
java.lang.String attName,
double value,
java.lang.String nominalLabel,
boolean isNominal,
boolean isString,
boolean treatZeroAsMissing,
boolean estimateQuantiles,
double quantileCompression)
Update the summary statistics for a given attribute with the given value
|
public static final java.lang.String ARFF_SUMMARY_ATTRIBUTE_PREFIX
public static final int MAX_PARSING_ERRORS
public CSVToARFFHeaderMapTask()
public CSVToARFFHeaderMapTask(boolean suppressQuantileOptions)
suppressQuantileOptions
- true if commandline options relating to
quantile estimation are to be suppressedpublic CSVToARFFHeaderMapTask(boolean suppressQuantileOptions, boolean suppressCSVParsingOptions)
suppressQuantileOptions
- true if command line options relating to
quantile estimation are to be suppressedsuppressCSVParsingOptions
- true if command line options relating to
CSV parsing are to be suppressedpublic static void updateSummaryStats(java.util.Map<java.lang.String,Stats> summaryStats, java.util.Map<java.lang.String,StringStats> backupStringStats, java.lang.String attName, double value, java.lang.String nominalLabel, boolean isNominal, boolean isString, boolean treatZeroAsMissing, boolean estimateQuantiles, double quantileCompression)
summaryStats
- the map of summary statisticsbackupStringStats
- the temporary map of backup string stats kept for
numeric fields (this can be null in cases where we are sure that
there is no chance of unparsable numeric values occuring)attName
- the name of the attribute being updatedvalue
- the value to update with (if the attribute is numeric)nominalLabel
- holds the label/string for the attribute (if it is
nominal or string)isNominal
- true if the attribute is nominalisString
- true if the attribute is a string attributetreatZeroAsMissing
- treats zero as missing value for numeric
attributesestimateQuantiles
- true if we should estimate quantiles tooquantileCompression
- the compression level to use in the TDigest
estimatorspublic static java.util.List<java.lang.String> instanceHeaderToAttributeNameList(weka.core.Instances header)
public static void main(java.lang.String[] args)
public static CSVToARFFHeaderMapTask combine(java.util.List<CSVToARFFHeaderMapTask> tasks) throws DistributedWekaException
tasks
- a list of CSVToARFFHeaderMapTasks to "combine"DistributedWekaException
- if a problem occurspublic java.util.Enumeration<weka.core.Option> listOptions()
listOptions
in interface weka.core.OptionHandler
public java.lang.String[] getOptions()
getOptions
in interface weka.core.OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
setOptions
in interface weka.core.OptionHandler
java.lang.Exception
public void setNumDecimalPlaces(int numDecimalPlaces)
numDecimalPlaces
- number of decimal places to usepublic int getNumDecimalPlaces()
public void setTreatUnparsableNumericValuesAsMissing(boolean unparsableNumericValuesToMissing)
unparsableNumericValuesToMissing
- public boolean getTreatUnparsableNumericValuesAsMissing()
public boolean getTreatZerosAsMissing()
public void setTreatZerosAsMissing(boolean t)
t
- true if zeros are to be treated as missing values for the purposes
of computing summary stats.public double getCompressionLevelForQuartileEstimation()
public void setCompressionLevelForQuartileEstimation(double compression)
compression
- the compression level (smaller values give higher
compression and less accurate estimates).public java.lang.String compressionLevelForQuartileEstimationTipText()
public boolean getComputeQuartilesAsPartOfSummaryStats()
public void setComputeQuartilesAsPartOfSummaryStats(boolean c)
c
- true if quartiles are to be estimatedpublic java.lang.String computeQuartilesAsPartOfSummaryStatsTipText()
public java.lang.String getMissingValue()
public void setMissingValue(java.lang.String value)
value
- the placeholderpublic java.lang.String missingValueTipText()
public java.lang.String getStringAttributes()
public void setStringAttributes(java.lang.String value)
value
- the rangepublic java.lang.String stringAttributesTipText()
public java.lang.String getNominalAttributes()
public void setNominalAttributes(java.lang.String value)
value
- the rangepublic java.lang.String nominalAttributesTipText()
public java.lang.String getDateFormat()
public void setDateFormat(java.lang.String value)
value
- the format to use.public java.lang.String dateFormatTipText()
public java.lang.String getDateAttributes()
public void setDateAttributes(java.lang.String value)
value
- the rangepublic java.lang.String dateAttributesTipText()
public java.lang.String enclosureCharactersTipText()
public java.lang.String getEnclosureCharacters()
public void setEnclosureCharacters(java.lang.String enclosure)
enclosure
- the characters to use as string enclosurespublic java.lang.String getFieldSeparator()
public void setFieldSeparator(java.lang.String value)
value
- the character to usepublic java.lang.String fieldSeparatorTipText()
public java.lang.String nominalDefaultLabelSpecsTipText()
public java.lang.Object[] getNominalDefaultLabelSpecs()
public void setNominalDefaultLabelSpecs(java.lang.Object[] specs)
specs
- an array of default label specificationspublic java.lang.String nominalLabelSpecsTipText()
public java.lang.Object[] getNominalLabelSpecs()
public void setNominalLabelSpecs(java.lang.Object[] specs)
specs
- an array of label specificationspublic void generateNames(int initial, int numAtts)
initial
- the number to use for the first attributenumAtts
- the number of attributes to generatepublic void generateNames(int numAtts)
numAtts
- the number of attribute names to generatepublic void initParserOnly(java.util.List<java.lang.String> attNames)
attNames
- the names of the attributes to usepublic java.lang.String[] parseRowOnly(java.lang.String row) throws java.io.IOException
row
- the row to parsejava.io.IOException
- if a problem occurspublic void processRowValues(java.lang.Object[] fieldVals, java.util.List<java.lang.String> attNames) throws DistributedWekaException, java.io.IOException
fieldVals
- the row values to processattNames
- the names of the attributes (fields)if
- the number of fields in the current row does not match the
number of attribute namesDistributedWekaException
java.io.IOException
public void processRow(java.lang.String row, java.util.List<java.lang.String> attNames) throws DistributedWekaException, java.io.IOException
row
- the row to processattNames
- the names of the attributes (fields)if
- the number of fields in the current row does not match the
number of attribute namesDistributedWekaException
java.io.IOException
public weka.core.Instances getHeader()
public CSVToARFFHeaderMapTask.HeaderAndQuantileDataHolder getHeaderAndQuantileEstimators() throws DistributedWekaException
DistributedWekaException
- if we are not computing summary statistics
or we are computing statistics but not quantilespublic void serializeAllQuantileEstimators()
public void deSerializeAllQuantileEstimators()
public boolean headerAvailableImmediately(int numFields, java.util.List<java.lang.String> attNames, java.lang.StringBuffer problems)
numFields
- number of fields in the dataattNames
- the names of the attributes (in order)problems
- a StringBuffer to hold problem descriptions (if any)public weka.core.Instances getHeader(int numFields, java.util.List<java.lang.String> attNames) throws DistributedWekaException
numFields
- the number of attributes in the dataattNames
- the attribute names to use. May be null, in which case
names are generatedDistributedWekaException
- if nominal attributes have been specified
but there are one or more tha have no user-supplied label
specificationspublic void fromHeader(weka.core.Instances headerWithSummary, java.util.Map<java.lang.String,TDigest> quantileEstimators) throws DistributedWekaException
headerWithSummary
- the ARFF header (with summary attributes) to
initialize withquantileEstimators
- a map (keyed by attribute name) of TDigest
estimators for numeric attributes (can be null if quantiles are
not being estimated)DistributedWekaException
- if a problem occurspublic weka.core.Instance makeInstance(weka.core.Instances trainingHeader, boolean setStringValues, java.lang.String[] parsed) throws java.lang.Exception
trainingHeader
- the header to associate the instance with. Does not
add the new instance to this data set; just gives the instance a
reference to the headersetStringValues
- true if any string values should be set in the
header as opposed to being added to the header (i.e. accumulating
in the header).parsed
- the array of parsed CSV valuesjava.lang.Exception
- if a problem occurspublic weka.core.Instance makeInstance(weka.core.Instances trainingHeader, boolean setStringValues, java.lang.String[] parsed, boolean sparse) throws java.lang.Exception
trainingHeader
- the header to associate the instance with. Does not
add the new instance to this data set; just gives the instance a
reference to the headersetStringValues
- true if any string values should be set in the
header as opposed to being added to the header (i.e. accumulating
in the header).parsed
- the array of parsed CSV valuessparse
- true if the new instance is to be a sparse instancejava.lang.Exception
- if a problem occurspublic weka.core.Instance makeInstanceFromObjectRow(weka.core.Instances trainingHeader, boolean setStringValues, java.lang.Object[] row, boolean sparse) throws java.lang.Exception
trainingHeader
- the header to associate the instance with. Does not
add the new instance to this data set; just gives the instance a
reference to the headersetStringValues
- true if any string values should be set in the
header as opposed to being added to the header (i.e. accumulating
in the header).row
- the array of Object valuessparse
- true if the new instance is to be a sparse instancejava.lang.Exception
- if a problem occurspublic java.lang.String getDefaultValue(int attIndex)
attIndex
- the index (0-based) of the attribute to get the default
value for