Author:Jesús M. Pérez <txus.perez{[at]}ehu.eus>
Category:Preprocessing, Experimenter
Date:2025-06-02
Depends:weka (>=3.8.6)
Description:Class for extracting the main descriptive characteristics of a dataset based on WEKA's simplest classifier, ZeroR.
When used as a classification algorithm in the WEKA's Experimenter (located in the "rules" group), it returns the descriptive features (number of classes, number of attributes...) of a set of datasets as if they were metrics (Comparison field) used to evaluate the goodness of the classifier (like Percent_correct, Area_under_ROC, Elapsed_Time_training...).
For proper results configuration (Setup tab of the Experimenter), it's recommended to set the 'Experiment Type' to "Train/Test Percentage Split (order preserved)" with 100% 'Train Percentage'. This ensures measures like Number_of_training_instances or NumMissingValuesDataset aren't affected by Train/Test data splits of the default 'Cross-validation' option.
To obtain research-ready results, specify 'CSV file' as 'Results Destination' and provide a filename. After running the experiment, the generated CSV can be opened in spreadsheet software, displaying datasets in rows and their complete features (plus ZeroR metrics) in columns - similar to the dataset description tables commonly found in machine learning publications.

List of extracted characteristics (all starting with “measure” due to WEKA naming convention):
  • NumAttributes: Number of attributes of the dataset (without class)
  • NumNumericAttributes: Number of numeric attributes of the dataset (without class)
  • NumNominalAttributes: Number of nominal attributes of the dataset (without class)
  • MissingValues: Whether there are missing values in the dataset (1.0) or not (0.0) (without class)
  • NumAttsMissingValues: Number of attributes with missing values (without class)
  • NumMissingValuesDataset: Number of examples with missing values in the dataset (without class)
  • PercentMissingValuesDataset: Percentage of examples with missing values in the dataset (without class)
  • NumClasses: Number of classes in the dataset
  • EmptyClass: Whether there are any empty classes (1.0) or not (0.0)
  • NumFirstClass: Number of examples of the first class (considered by WEKA as positive by default)
  • MinClassIndex: Index of minority class (discarding empty classes)
  • NumMinClass: Number of examples of minority class (discarding empty classes)
  • PercentMinClass: Percentage of examples of minority class (discarding empty classes)
  • NumMajClass: Number of examples of majority class
  • PercentMajClass: Percentage of examples of majority class
  • ImbalancedRatio: Imbalanced Ratio (IR)
This class was used in the following paper where an extensive experimentation was carried out with 96 different datasets:
Jesús M. Pérez and Olatz Arbelaitz.
"Multi-Criteria Node Selection in Direct PCTBagging: Balancing Interpretability and Accuracy with Bootstrap Sampling and Unrestricted Pruning". Information Sciences (2025), submitted.
doi:10.1016/j.ins.2025.XX.XXX
Enhances:
License:GPL 3.0
Maintainer:Jesús M. Pérez <txus.perez{[at]}ehu.eus>
PackageURL:http://www.aldapa.eus/res/weka-dataextractor/DatasetCharacteristicsExtractor-v1.0.zip
Related:ZeroR
URL:http://www.aldapa.eus/res/weka-dataextractor
Version:1.0