Creates a vector column of features from a collection of feature columns
Model produced by AssembleFeatures.
Defines common inheritance and functions across auto trained models.
Defines common inheritance and parameters across trainers.
Basic constraints for generating a dataset.
Model produced by FindBestModel.
Model produced by FindBestModel.
Blurs the image using a box filter.
Blurs the image using a box filter. The params are a map of the dimensions of the blurring box. Please refer to OpenCV for more information.
Utility methods for manipulating the BrainScript and overrides configs output to disk.
Param for ByteArray.
Param for ByteArray. Needed as spark has explicit params for many different types but not ByteArray.
Cache the dataset at this point to memory or memory and disk
An estimator that calculates the weights for balancing a dataset.
An estimator that calculates the weights for balancing a dataset. For example, if the negative class is half the size of the positive class, the weights will be 2 for rows with negative classes and 1 for rows with positive classes. these weights can be used in weighted classifiers and regressors to correct for heavilty skewed datasets. The inputCol should be the labels of the classes, and the output col will be the requisite weights.
Defines the Booster parameters passed to the LightGBM classifier.
Removes missing values from input dataset.
Removes missing values from input dataset.
The following modes are supported:
Mean - replaces missings with mean of fit column
Median - replaces missings with approximate median of fit column
Custom - replaces missings with custom value specified by user
For mean and median modes, only numeric column types are supported, specifically:
Int
, Long
, Float
, Double
For custom mode, the types above are supported and additionally:
String
, Boolean
Model produced by CleanMissingData.
Converts an image from one color space to another, eg COLOR_BGR2GRAY.
Converts an image from one color space to another, eg COLOR_BGR2GRAY. Refer to OpenCV for more information.
Class containing the list of column names to perform special featurization steps for.
Class containing the list of column names to perform special featurization steps for. colNamesToHash - List of column names to hash. colNamesToDuplicateForMissings - List of column names containing doubles to duplicate so we can remove missing values from them. colNamesToTypes - Map of column names to their types. colNamesToCleanMissings - List of column names to clean missing values from (ignore). colNamesToVectorize - List of column names to vectorize using FastVectorAssembler. categoricalColumns - List of categorical columns to pass through or turn into indicator array. conversionColumnNamesMap - Map from old column names to new. addedColumnNamesMap - Map from old columns to newly generated columns for featurization.
Evaluates the given scored dataset.
Evaluates the given scored dataset with per instance metrics.
Evaluates the given scored dataset with per instance metrics.
The Regression metrics are: - L1_loss - L2_loss
The Classification metrics are: - log_loss
This trait allows you to easily add serialization to your Spark Models, assuming that they are completely parameterized by their constructor.
This trait allows you to easily add serialization to your Spark
Models, assuming that they are completely parameterized by their constructor.
The main two fields required ate the TypeTag
that allows the
writer to inspect the constructor to get the types that need to be serialized,
the actual objects that are serialized need to be defined in the field
objectsToSave.
Crops the image for processing.
Crops the image for processing. The parameters are: "x" - First dimension; start of crop "y" - second dimension - start of crop "height" -height of cropped image "width" - width of cropped image "stageName" - "crop"
Converts the specified list of columns to the specified type.
Converts the specified list of columns to the specified type. Returns a new DataFrame with the converted columns
Options used to specify how a dataset will be generated.
Options used to specify how a dataset will be generated. This contains information on what the data and column types (specified as flags) for generating a dataset will be limited to. It also contain options for all possible missing values generation and options for how values will be generated.
Represents a distribution of values.
Represents a distribution of values.
The type T of the values generated.
DropColumns
takes a dataframe and a list of columns to drop as input and returns
a dataframe comprised of only those columns not listed in the input list.
DropColumns
takes a dataframe and a list of columns to drop as input and returns
a dataframe comprised of only those columns not listed in the input list.
Featurizes a dataset.
Featurizes a dataset. Converts the specified columns to feature columns.
Evaluates and chooses the best model from a list of models.
Evaluates and chooses the best model from a list of models.
Flips the image
Applies gaussian kernel to blur the image.
Applies gaussian kernel to blur the image. Please refer to OpenCV for detailed information about the parameters and their allowable values.
Generates the specified random data type.
Represents a parameter grid for tuning with discrete values.
Represents a parameter grid for tuning with discrete values. Can be generated with the ParamGridBuilder.
Specifies the trait for constraints on generating a dataset.
Specifies the search space for hyperparameters.
The ImageFeaturizer
relies on a CNTK model to do the featurization, one can set
this model using the modelLocation
parameter.
The ImageFeaturizer
relies on a CNTK model to do the featurization, one can set
this model using the modelLocation
parameter. To map the nodes of the CNTK model
onto the standard "layers" structure of a feed forward neural net, one needs to supply a list of
node names that range from the output node, back towards the input node of the CNTK Function.
This list does not need to be exhaustive, and is provided to you if you use a model downloaded
from the ModelDownloader
, one can find this layer list in the schema of the
downloaded model.
The ImageFeaturizer
takes an input column of images (the type returned by the
ImageReader
), and automatically resizes them to fit the CMTKModel's inputs. It
then feeds them through a pre-trained CNTK model. One can truncate the model using the
cutOutputLayers
parameter that determines how many layers to truncate from the output of
the network. For example, layer=0 means that no layers are removed, layer=2 means that the
image featurizer returns the activations of the layer that is two layers from the output layer.
Distributed implementation of Local Interpretable Model-Agnostic Explanations (LIME)
Distributed implementation of Local Interpretable Model-Agnostic Explanations (LIME)
https://arxiv.org/pdf/1602.04938v1.pdf
Image processing stage.
Image processing stage. Please refer to OpenCV for additional information
Image processing stage.
This class takes in a categorical column with MML style attibutes and then transforms it back to the original values.
This class takes in a categorical column with MML style attibutes and then transforms it back to the original values. This extends MLLIB IndexToString by allowing the transformation back to any types of values.
Generate the internal wrapper for a given class.
Generate the internal wrapper for a given class. Used for complicated wrappers, where the basic functionality is auto-generated, and the rest is added in the inherited wrapper.
Represents a LightGBM Booster learner
Represents a LightGBM Booster learner
Model produced by LightGBMClassifier.
Model produced by LightGBMClassifier.
Trains a LightGBM Binary Classification model, a fast, distributed, high performance gradient boosting framework based on decision tree algorithms.
Trains a LightGBM Binary Classification model, a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. For more information please see here: https://github.com/Microsoft/LightGBM. For parameter information see here: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
Defines common parameters across all LightGBM learners.
Model produced by LightGBMRegressor.
Model produced by LightGBMRegressor.
Trains a LightGBM Regression model, a fast, distributed, high performance gradient boosting framework based on decision tree algorithms.
Trains a LightGBM Regression model, a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. For more information please see here: https://github.com/Microsoft/LightGBM. For parameter information see here: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst Note: The application parameter supports the following values:
Helper class for logging metrics to log4j.
Class for downloading models from a server to Local or HDFS
Exception returned if a repo cannot find the file
Class representing the schema of a CNTK model
Class representing the schema of a CNTK model
name of the model architecture
dataset the model was trained on
type of problem the model is suited for eg: (image, text, sound, sentiment etc)
location of the underlying file (local, HDFS, or HTTP)
sha256 hash of the underlying file
size in bytes of the underlying file
the node which represents the input
the number of layers of the model
the names nodes that represent layers in the network
The MultiColumnAdapter
takes a unary pipeline stage and a list of input output column pairs
and applies the pipeline stage to each input column after being fit
The MultiColumnAdapter
takes a unary pipeline stage and a list of input output column pairs
and applies the pipeline stage to each input column after being fit
Extracts several ngrams
Splits text into chunks of at most n characters
Contraints on generating a dataset where all parameters are randomly generated.
Base abstract class for random generation of data.
Base abstract class for random generation of data.
The data to generate.
Randomly generates a row given the set space of data, column options.
Combines an array of row generators into a single row generator.
Represents a generator of parameters with specified distributions added by the HyperparamBuilder.
Model from train validation split.
Defines the Booster parameters passed to the LightGBM regressor.
RenameColumn
takes a dataframe with an input and an output column name
and returns a dataframe comprised of the original columns with the input column renamed
as the output column name.
RenameColumn
takes a dataframe with an input and an output column name
and returns a dataframe comprised of the original columns with the input column renamed
as the output column name.
Partitions the dataset into n partitions
Resizes the image.
Resizes the image. The parameters of the ParameterMap are: "height" - the height of the image "width" "stageName" Please refer to OpenCV for more information
Smart Adaptive Recommendations (SAR) Algorithm
Smart Adaptive Recommendations (SAR) Algorithm
https://aka.ms/reco-sar
SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transactions history and items description. It produces easily explainable / interpretable recommendations
SAR has been show to provide higher ranking measurements when compared to ALS. https://github.com/Microsoft/Recommenders
SAR Model
SAR Model
Abstract representation of a schema for an item that can be held in a repository
SelectColumns
takes a dataframe and a list of columns to select as input and returns
a dataframe comprised of only those columns listed in the input list.
SelectColumns
takes a dataframe and a list of columns to select as input and returns
a dataframe comprised of only those columns listed in the input list.
The columns to be selected is a list of column names
Holds a variable shared among all workers that behaves like a local singleton.
Holds a variable shared among all workers that behaves like a local singleton. Useful to use non-serializable objects in Spark closures that maintain state across tasks.
Holds a variable shared among all workers.
Holds a variable shared among all workers. Useful to use non-serializable objects in Spark closures.
Note this code has been borrowed from: https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/
Compute summary statistics for the dataset.
Compute summary statistics for the dataset. The following statistics are computed: - counts - basic - sample - percentiles - errorThreshold - error threshold for quantiles
A transformer that decomposes an image into it's superpixels
Featurize text.
TextPreprocessor
takes a dataframe and a dictionary
that maps (text -> replacement text), scans each cell in the input col
and replaces all substring matches with the corresponding value.
TextPreprocessor
takes a dataframe and a dictionary
that maps (text -> replacement text), scans each cell in the input col
and replaces all substring matches with the corresponding value.
Priority is given to longer keys and from left to right.
Applies a threshold to each element of the image.
Applies a threshold to each element of the image. Please refer to threshold for more information
Trains a classification model.
Trains a classification model. Featurizes the given data into a vector of doubles.
Note the behavior of the reindex and labels parameters, the parameters interact as:
reindex -> false labels -> false (Empty) Assume all double values, don't use metadata, assume natural ordering
reindex -> true labels -> false (Empty) Index, use natural ordering of string indexer
reindex -> false labels -> true (Specified) Assume user knows indexing, apply label values. Currently only string type supported.
reindex -> true labels -> true (Specified) Validate labels matches column type, try to recast to label type, reindex label column
The currently supported classifiers are: Logistic Regression Classifier Decision Tree Classifier Random Forest Classifier Gradient Boosted Trees Classifier Naive Bayes Classifier Multilayer Perceptron Classifier In addition to any generic learner that inherits from Predictor.
Defines the common Booster parameters passed to the LightGBM learners.
Trains a regression model.
Trains a regression model.
Model produced by TrainClassifier.
Model produced by TrainClassifier.
Model produced by TrainRegressor.
Model produced by TrainRegressor.
Tunes model hyperparameters
Tunes model hyperparameters
Allows user to specify multiple untrained models to tune using various search strategies. Currently supports cross validation with random grid search.
Model produced by TuneHyperparameters.
Model produced by TuneHyperparameters.
UDFTransformer
takes as input input column, output column, and a UserDefinedFunction
returns a dataframe comprised of the original columns with the output column as the result of the
udf applied to the input column
UDFTransformer
takes as input input column, output column, and a UserDefinedFunction
returns a dataframe comprised of the original columns with the output column as the result of the
udf applied to the input column
UnicodeNormalize
takes a dataframe and normalizes the unicode representation.
UnicodeNormalize
takes a dataframe and normalizes the unicode representation.
Converts the representation of an m X n pixel image to an m * n vector of Doubles
Converts the representation of an m X n pixel image to an m * n vector of Doubles
The input column name is assumed to be "image", the output column name is "<uid>_output"
Converts the representation of an m X n pixel image to an m * n vector of Doubles
Converts the representation of an m X n pixel image to an m * n vector of Doubles
The input column name is assumed to be "image", the output column name is "<uid>_output"
Fits a dictionary of values from the input column.
Fits a dictionary of values from the input column. Model then transforms a column to a categorical column of the given array of values. Similar to StringIndexer except it can be used on any value types.
Model produced by ValueIndexer.
Implicit conversion allows sparkSession.readImages(...) syntax Example: import com.microsoft.ml.spark.Readers.implicits._ sparkSession.readImages(path, recursive = false)
Utilities for casting values.
Cache the dataset to memory or memory and disk.
Specifies the column types supported in spark dataframes and modules.
DataConversion object.
Specifies the data types supported in spark dataframes and modules.
Utilities for reducing data to CNTK format and generating the text file output to disk.
Provides good default hyperparameter ranges and values for sweeping.
Provides good default hyperparameter ranges and values for sweeping. Publicly visible to users so they can easily select the parameters for sweeping.
Defines methods to generate a random spark DataFrame dataset based on given options.
Pipelined image processing.
Contains logic for loading classes.
Helper utilities for LightGBM learners
Constants for PartitionSample
.
Constants for PartitionSample
.
Resize object contains the information for resizing; "height" "width" "stageName" = "resize"
Based on "Superpixel algorithm implemented in Java" at popscan.blogspot.com/2014/12/superpixel-algorithm-implemented-in-java.html
Microsoft Machine Learning on Apache Spark (MMLSpark)