SparkJob

java.lang.Object
- distributed.core.DistributedJob
- - weka.distributed.spark.SparkJob

All Implemented Interfaces:

java.io.Serializable, CommandlineRunnable, EnvironmentHandler, OptionHandler

Direct Known Subclasses:

ArffHeaderSparkJob, CanopyClustererSparkJob, CorrelationMatrixSparkJob, KMeansClustererSparkJob, RandomizedDataSparkJob, WekaClassifierEvaluationSparkJob, WekaClassifierSparkJob, WekaScoringSparkJob
```
public abstract class SparkJob
extends distributed.core.DistributedJob
implements OptionHandler
```
Abstract base class for Spark jobs.

Version:

$Revision: 13963 $

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com)

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class SparkJob.NoKeyTextOutputFormat<K,V>
Subclass of TextOutputFormat that does not write the key.
- Nested classes/interfaces inherited from class distributed.core.DistributedJob
  distributed.core.DistributedJob.JobStatus

Nested Classes
Modifier and Type	Class and Description
`static class`	`SparkJob.NoKeyTextOutputFormat<K,V>` Subclass of TextOutputFormat that does not write the key.

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String TEST_DATA
The key for a test RDD

static java.lang.String TRAINING_DATA
The key for a training RDD
- Fields inherited from class distributed.core.DistributedJob
  WEKA_ADDITIONAL_PACKAGES_KEY

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`TEST_DATA` The key for a test RDD
`static java.lang.String`	`TRAINING_DATA` The key for a training RDD

Constructor Summary

Constructors
Constructor and Description

SparkJob(java.lang.String jobName, java.lang.String jobDescription)
Constructor.

Constructors
Constructor and Description
`SparkJob(java.lang.String jobName, java.lang.String jobDescription)` Constructor.

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`static java.lang.String`	`addSubdirToPath(java.lang.String parent, java.lang.String subdirName)` Adds a subdirectory to a parent path.
`static boolean`	`checkFileExists(java.lang.String file)` Check that the named file exists on either the local file system or HDFS.
`org.apache.spark.api.java.JavaSparkContext`	`createSparkContextForJob(SparkJobConfig conf)` Create a SparkContext for this job.
`java.lang.String`	`debugTipText()` Tip text for this property.
`static void`	`deleteDirectory(java.lang.String path)` Delete a directory (and all contents).
`java.lang.String[]`	`getBaseOptionsOnly()` Return the base options only (not the subclasses options or the options specific to the configuration)
`CachingStrategy`	`getCachingStrategy()` Get the caching strategy to use for this job
`Dataset<?>`	`getDataset(java.lang.String key)` Return a named dataset, or null if the name is unknown.
`java.util.Iterator<java.util.Map.Entry<java.lang.String,Dataset<?>>>`	`getDatasets()` Return an iterator over the named datasets for this job
`boolean`	`getDebug()` Get whether to output debug info.
`static org.apache.hadoop.conf.Configuration`	`getFSConfigurationForPath(java.lang.String path, java.lang.String[] pathOnly)` Returns a Configuration object configured with the name node and port present in the supplied path (hdfs://host:port/path).
`java.lang.String[]`	`getOptions()`
`static long`	`getSizeInBytesOfPath(java.lang.String path)` Get the size in bytes of a file/directory
`org.apache.spark.api.java.JavaSparkContext`	`getSparkContext()` Get the current spark context in use by this (and potentially other) jobs.
`SparkJobConfig`	`getSparkJobConfig()` Get the SparkJobConfig object for this job
`org.apache.log4j.WriterAppender`	`initJob(org.apache.spark.api.java.JavaSparkContext context)` Initialize logging and set or create a context to use.
`org.apache.log4j.WriterAppender`	`initSparkLogAppender()` Initialize and return an appender for hooking into Spark's log4j logging and directing it to Weka's log
`java.util.Enumeration<Option>`	`listOptions()`
`org.apache.spark.api.java.JavaRDD<Instance>`	`loadCSVFile(java.lang.String path, Instances headerNoSummary, java.lang.String csvParseOptions, org.apache.spark.api.java.JavaSparkContext sparkContext, CachingStrategy strategy, int minPartitions, boolean enforceMaxPartitions)` Load a file/directory containing instances in CSV format.
`org.apache.spark.api.java.JavaRDD<Instance>`	`loadInput(java.lang.String inputPath, Instances headerNoSummary, java.lang.String csvParseOptions, org.apache.spark.api.java.JavaSparkContext sparkContext, CachingStrategy strategy, int minPartitions, boolean enforceMaxPartitions)` Load an input file/directory.
`org.apache.spark.api.java.JavaRDD<Instance>`	`loadInstanceObjectFile(java.lang.String path, org.apache.spark.api.java.JavaSparkContext sparkContext, CachingStrategy strategy, int minPartitions, boolean enforceMaxPartitions)` Load a file/directory of serialized instances (as stored in Spark object file format).
`static java.io.InputStream`	`openFileForRead(java.lang.String file)` Opens the named file for reading on either the local file system or HDFS.
`static java.io.OutputStream`	`openFileForWrite(java.lang.String file)` Open the named file for writing to on either the local file system or HDFS.
`static java.io.PrintWriter`	`openTextFileForWrite(java.lang.String file)` Open the named file as a text file for writing to on either the local file system or any other protocol specific file system supported by Hadoop.
`void`	`removeSparkLogAppender(org.apache.log4j.WriterAppender appender)` Remove the supplied appender from Spark's logging.
`static java.lang.String`	`resolveLocalOrOtherFileSystemPath(java.lang.String original)` Takes an input path and returns a fully qualified absolute one.
`boolean`	`runJob()`
`abstract boolean`	`runJobWithContext(org.apache.spark.api.java.JavaSparkContext sparkContext)` Clients to implement
`void`	`setCachingStrategy(CachingStrategy cs)` Set the caching strategy to use for this job
`void`	`setDataset(java.lang.String key, Dataset dataset)` Set a dataset for this job to potentially make use of
`void`	`setDebug(boolean d)` Set whether to output debug info.
`void`	`setOptions(java.lang.String[] options)`
`void`	`shutdownJob(org.apache.log4j.WriterAppender logAppender)` Shuts down the context in use by this job and removes the supplied log appender object (if any) from the spark logger.
`org.apache.spark.api.java.JavaRDD<Instance>`	`stringRDDToInstanceRDD(org.apache.spark.api.java.JavaRDD<java.lang.String> input, Instances headerNoSummary, java.lang.String csvParseOptions, CachingStrategy strategy, boolean enforceMaxPartitions)` Process an `RDD<String>` into an `RDD<Instance>`

Methods inherited from class distributed.core.DistributedJob
environmentSubstitute, getAdditionalWekaPackageNames, getJobName, getJobStatus, getLog, logMessage, logMessage, logMessage, makeOptionsStr, objectRowToInstance, parseInstance, postExecution, preExecution, run, setEnvironment, setJobDescription, setJobName, setJobStatus, setLog, setStatusMessagePrefix, stackTraceToString, statusMessage, stopJob

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - TRAINING_DATA
```
public static final java.lang.String TRAINING_DATA
```
    The key for a training RDD
    
    See Also:
    
    Constant Field Values
  - TEST_DATA
```
public static final java.lang.String TEST_DATA
```
    The key for a test RDD
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - SparkJob
```
public SparkJob(java.lang.String jobName,
                java.lang.String jobDescription)
```
    Constructor.
    
    Parameters:
    
    jobName - the name of this job
    
    jobDescription - the description of this job
- Method Detail
  - addSubdirToPath
```
public static java.lang.String addSubdirToPath(java.lang.String parent,
                                               java.lang.String subdirName)
```
    Adds a subdirectory to a parent path. Handles local and hdfs:// files
    
    Parameters:
    
    parent - the parent (may include the hdfs://host:port part
    
    subdirName - the name of the subdirectory to add
    
    Returns:
    
    the parent path with the sudirectory appended.
  - getFSConfigurationForPath
```
public static org.apache.hadoop.conf.Configuration getFSConfigurationForPath(java.lang.String path,
                                                                             java.lang.String[] pathOnly)
```
    Returns a Configuration object configured with the name node and port present in the supplied path (hdfs://host:port/path). Also returns the path-only part of the URI. Note that absolute paths will require an extra /. E.g. hdfs://host:port//users/fred/input. Also handles local files system paths if no protocol is supplied - e.g. bob/george for a relative path (relative to the current working directory) or /bob/george for an absolute path.
    
    Parameters:
    
    path - the URI or local path from which to configure
    
    pathOnly - will hold the path-only part of the URI
    
    Returns:
    
    a Configuration object
  - resolveLocalOrOtherFileSystemPath
```
public static java.lang.String resolveLocalOrOtherFileSystemPath(java.lang.String original)
                                                          throws java.io.IOException
```
    Takes an input path and returns a fully qualified absolute one. Handles local and non-local file systems. Original path can be a relative one. In the case of a local file system, the relative path (relative to the working directory) is made into an absolute one. In the case of a file system such as HDFS, an absolute path will require an additional '/' - E.g. hdfs://host:port//users/fred/input - otherwise it will be treated as relative (to the user's home directory in HDFS). In either case, the returned path will be an absolute one.
    
    Parameters:
    
    original - original path (either relative or absolute) on a file system
    
    Returns:
    
    absolute path
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - deleteDirectory
```
public static void deleteDirectory(java.lang.String path)
                            throws java.io.IOException
```
    Delete a directory (and all contents).
    
    Parameters:
    
    path - the path to the directory to delete
    
    Throws:
    
    java.io.IOException - if the path is not a directory or a problem occurs
  - openFileForRead
```
public static java.io.InputStream openFileForRead(java.lang.String file)
                                           throws java.io.IOException
```
    Opens the named file for reading on either the local file system or HDFS. HDFS files should use the form "hdfs://host:port/<path>"
    
    Parameters:
    
    file - the file to open for reading on either the local or HDFS file system
    
    Returns:
    
    an InputStream for the file
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - openFileForWrite
```
public static java.io.OutputStream openFileForWrite(java.lang.String file)
                                             throws java.io.IOException
```
    Open the named file for writing to on either the local file system or HDFS. HDFS files should use the form "hdfs://host:port/<path>". Note that, on the local file system, the directory path must exist. Under HDFS, the path is created automatically.
    
    Parameters:
    
    file - the file to write to
    
    Returns:
    
    an OutputStream for writing to the file
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - openTextFileForWrite
```
public static java.io.PrintWriter openTextFileForWrite(java.lang.String file)
                                                throws java.io.IOException
```
    Open the named file as a text file for writing to on either the local file system or any other protocol specific file system supported by Hadoop. Protocol files should use the form "protocol://host:port/<path>." Note that, on the local file system, the directory path must exist.
    
    Parameters:
    
    file - the file to write to
    
    Returns:
    
    an PrintWriter for writing to the file
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - checkFileExists
```
public static boolean checkFileExists(java.lang.String file)
                               throws java.io.IOException
```
    Check that the named file exists on either the local file system or HDFS.
    
    Parameters:
    
    file - the file to check
    
    Returns:
    
    true if the file exists on the local file system or in HDFS
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - getSizeInBytesOfPath
```
public static long getSizeInBytesOfPath(java.lang.String path)
                                 throws java.io.IOException
```
    Get the size in bytes of a file/directory
    
    Parameters:
    
    path - the path to the file/directory
    
    Returns:
    
    the size in bytes
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - createSparkContextForJob
```
public org.apache.spark.api.java.JavaSparkContext createSparkContextForJob(SparkJobConfig conf)
                                                                    throws WekaException
```
    Create a SparkContext for this job. Configures the context with required libraries, additional weka packages etc.
    
    Parameters:
    
    conf - the configuration for the job
    
    Returns:
    
    a SparkContext for the job
    
    Throws:
    
    WekaException - if a problem occurs
  - listOptions
```
public java.util.Enumeration<Option> listOptions()
```
    Specified by:
    
    listOptions in interface OptionHandler
  - getOptions
```
public java.lang.String[] getOptions()
```
    Specified by:
    
    getOptions in interface OptionHandler
  - setOptions
```
public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
```
    Specified by:
    
    setOptions in interface OptionHandler
    
    Throws:
    
    java.lang.Exception
  - getBaseOptionsOnly
```
public java.lang.String[] getBaseOptionsOnly()
```
    Return the base options only (not the subclasses options or the options specific to the configuration)
    
    Returns:
    
    just the base options
  - debugTipText
```
public java.lang.String debugTipText()
```
    Tip text for this property.
    
    Returns:
    
    the tip text for this property.
  - getDebug
```
public boolean getDebug()
```
    Get whether to output debug info.
    
    Returns:
    
    true if debug info is to be output.
  - setDebug
```
public void setDebug(boolean d)
```
    Set whether to output debug info.
    
    Parameters:
    
    d - true if debug info is to be output.
  - initSparkLogAppender
```
public org.apache.log4j.WriterAppender initSparkLogAppender()
```
    Initialize and return an appender for hooking into Spark's log4j logging and directing it to Weka's log
    
    Returns:
    
    an appender
  - removeSparkLogAppender
```
public void removeSparkLogAppender(org.apache.log4j.WriterAppender appender)
```
    Remove the supplied appender from Spark's logging. This should be called at the end of a job
    
    Parameters:
    
    appender - the appender to remove
  - getCachingStrategy
```
public CachingStrategy getCachingStrategy()
```
    Get the caching strategy to use for this job
    
    Returns:
    
    the caching strategy to use for this job
  - setCachingStrategy
```
public void setCachingStrategy(CachingStrategy cs)
```
    Set the caching strategy to use for this job
    
    Parameters:
    
    cs - the caching strategy to use for this job
  - loadInstanceObjectFile
```
public org.apache.spark.api.java.JavaRDD<Instance> loadInstanceObjectFile(java.lang.String path,
                                                                          org.apache.spark.api.java.JavaSparkContext sparkContext,
                                                                          CachingStrategy strategy,
                                                                          int minPartitions,
                                                                          boolean enforceMaxPartitions)
                                                                   throws java.io.IOException
```
    Load a file/directory of serialized instances (as stored in Spark object file format).
    
    Parameters:
    
    path - the path to the file or directory to load
    
    sparkContext - the context to use
    
    strategy - the optional caching strategy to use
    
    minPartitions - the minimum number of partitions/slices to create (may be <= 0 to indicate that the default should be used)
    
    enforceMaxPartitions - if true then any max partitions specified by the user will be enforced (this might trigger a shuffle operation)
    
    Returns:
    
    a JavaRDD<Instance> dataset
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - stringRDDToInstanceRDD
```
public org.apache.spark.api.java.JavaRDD<Instance> stringRDDToInstanceRDD(org.apache.spark.api.java.JavaRDD<java.lang.String> input,
                                                                          Instances headerNoSummary,
                                                                          java.lang.String csvParseOptions,
                                                                          CachingStrategy strategy,
                                                                          boolean enforceMaxPartitions)
```
    Process an RDD<String> into an RDD<Instance>
    
    Parameters:
    
    input - the RDD<String> input
    
    headerNoSummary - the header of the data without summary attributes
    
    csvParseOptions - the options for the CSV parser
    
    strategy - the optional caching strategy to use
    
    enforceMaxPartitions - if true then any max partitions specified by the user will be enforced (this might trigger a shuffle operation)
    
    Returns:
    
    a JavaRDD<Instance> dataset
  - loadCSVFile
```
public org.apache.spark.api.java.JavaRDD<Instance> loadCSVFile(java.lang.String path,
                                                               Instances headerNoSummary,
                                                               java.lang.String csvParseOptions,
                                                               org.apache.spark.api.java.JavaSparkContext sparkContext,
                                                               CachingStrategy strategy,
                                                               int minPartitions,
                                                               boolean enforceMaxPartitions)
                                                        throws java.io.IOException
```
    Load a file/directory containing instances in CSV format.
    
    Parameters:
    
    path - the path to the file or directory to load
    
    headerNoSummary - the header to use (sans summary attributes)
    
    csvParseOptions - options to the CSV parser
    
    sparkContext - the context to use
    
    strategy - the optional caching strategy to use
    
    minPartitions - the minimum number of partitions/slices to create (may be <= 0 to indicate that the default should be used)
    
    enforceMaxPartitions - if true then any max partitions specified by the user will be enforced (this might trigger a shuffle operation)
    
    Returns:
    
    a JavaRDD<Instance> dataset
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - loadInput
```
public org.apache.spark.api.java.JavaRDD<Instance> loadInput(java.lang.String inputPath,
                                                             Instances headerNoSummary,
                                                             java.lang.String csvParseOptions,
                                                             org.apache.spark.api.java.JavaSparkContext sparkContext,
                                                             CachingStrategy strategy,
                                                             int minPartitions,
                                                             boolean enforceMaxPartitions)
                                                      throws java.io.IOException
```
    Load an input file/directory. Assumes CSV input unless -serialize has been set.
    
    Parameters:
    
    inputPath - the path to the file or directory to load
    
    headerNoSummary - the header of the data (sans summary attributes)
    
    csvParseOptions - options to the CSV parser (used if source is CSV)
    
    sparkContext - the context to use
    
    strategy - the caching strategy to use
    
    minPartitions - the minimum number of partitions/slices to create (may be <= 0 to indicate that the default should be used)
    
    enforceMaxPartitions - if true then any max partitions specified by the user will be enforced (this might trigger a shuffle operation)
    
    Returns:
    
    an JavaRDD<Instance> dataset
    
    Throws:
    
    java.io.IOException - if a problem occurs
  - getSparkJobConfig
```
public SparkJobConfig getSparkJobConfig()
```
    Get the SparkJobConfig object for this job
    
    Returns:
    
    the SparkJobConfig object
  - getSparkContext
```
public org.apache.spark.api.java.JavaSparkContext getSparkContext()
```
    Get the current spark context in use by this (and potentially other) jobs.
    
    Returns:
    
    the current spark context (or null if there isn't one)
  - setDataset
```
public void setDataset(java.lang.String key,
                       Dataset dataset)
```
    Set a dataset for this job to potentially make use of
    
    Parameters:
    
    key - the name of the dataset
    
    dataset - the dataset itself
  - getDataset
```
public Dataset<?> getDataset(java.lang.String key)
```
    Return a named dataset, or null if the name is unknown.
    
    Parameters:
    
    key - the name of the dataset to get
    
    Returns:
    
    the named dataset or null
  - getDatasets
```
public java.util.Iterator<java.util.Map.Entry<java.lang.String,Dataset<?>>> getDatasets()
```
    Return an iterator over the named datasets for this job
    
    Returns:
    
    an iterator over the datasets for this job
  - initJob
```
public org.apache.log4j.WriterAppender initJob(org.apache.spark.api.java.JavaSparkContext context)
                                        throws java.lang.Exception
```
    Initialize logging and set or create a context to use. This method should be used in the case where multiple jobs are going to be executed within the same JVM and thus need to share one spark context. It configures the supplied/created context with a caching strategy and installs a log appender in the spark logger. This method should be called just once, on the first job that is run. The spark context (for use in subsequent jobs) can be obtained after this by calling getSparkContext() on the first job.
    
    Parameters:
    
    context - the context to use (or null to create a new context)
    
    Returns:
    
    the logAppender attached to the spark logger
    
    Throws:
    
    java.lang.Exception - if a problem occurs
  - shutdownJob
```
public void shutdownJob(org.apache.log4j.WriterAppender logAppender)
```
    Shuts down the context in use by this job and removes the supplied log appender object (if any) from the spark logger. This method should be used in the case where multiple jobs are going to be executed within the same JVM and thus need to share one spark context. It should be called after the last job has completed executing
    
    Parameters:
    
    logAppender - the log appender to remove from the spark logger. May be null.
  - runJobWithContext
```
public abstract boolean runJobWithContext(org.apache.spark.api.java.JavaSparkContext sparkContext)
                                   throws java.io.IOException,
                                          weka.distributed.DistributedWekaException
```
    Clients to implement
    
    Parameters:
    
    sparkContext - the context to use
    
    Returns:
    
    true if the job was successful
    
    Throws:
    
    java.io.IOException - if a IO problem occurs
    
    weka.distributed.DistributedWekaException - if any other problem occurs
  - runJob
```
public boolean runJob()
               throws weka.distributed.DistributedWekaException
```
    Specified by:
    
    runJob in class distributed.core.DistributedJob
    
    Throws:
    
    weka.distributed.DistributedWekaException

Class SparkJob

Nested Class Summary

Nested classes/interfaces inherited from class distributed.core.DistributedJob

Field Summary

Fields inherited from class distributed.core.DistributedJob

Constructor Summary

Method Summary

Methods inherited from class distributed.core.DistributedJob

Methods inherited from class java.lang.Object

Field Detail

TRAINING_DATA

TEST_DATA

Constructor Detail

SparkJob

Method Detail

addSubdirToPath

getFSConfigurationForPath

resolveLocalOrOtherFileSystemPath

deleteDirectory

openFileForRead

openFileForWrite

openTextFileForWrite

checkFileExists

getSizeInBytesOfPath

createSparkContextForJob

listOptions

getOptions

setOptions

getBaseOptionsOnly

debugTipText

getDebug

setDebug

initSparkLogAppender

removeSparkLogAppender

getCachingStrategy

setCachingStrategy

loadInstanceObjectFile

stringRDDToInstanceRDD

loadCSVFile

loadInput

getSparkJobConfig

getSparkContext

setDataset

getDataset

getDatasets

initJob

shutdownJob

runJobWithContext

runJob