Modules¶

Abstract Runner¶

The general algorithm for all of the data-based variable importance methods is the same, regardless of whether the method is Sequential Selection or Permutation Importance or something else. This is represented in the abstract_variable_importance function. All of the different methods we provide use this function under the hood and the only difference between them is the selection_strategy object, which is detailed in PermutationImportance.selection_strategies. Typically, you will not need to use this method but can instead use one of the methods imported directly into the top package of PermutationImportance.

If you wish to implement your own variable importance method, you will need to devise your own selection_strategy. We recommend using PermutationImportance.selection_strategies as a template for implementing your own variable importance method.

PermutationImportance.abstract_runner._multithread_iteration(selection_iterator, scoring_fn, njobs)[source]¶

Handles a single pass of the abstract variable importance algorithm using multithreading

Parameters:	selection_iterator – an iterator which yields triples `(variable, training_data, scoring_data)`. Typically a `PermutationImportance.selection_strategies.SelectionStrategy` scoring_fn – a function to be used for scoring. Should be of the form `(training_data, scoring_data) -> float` num_jobs – number of processes to use
Returns:	a dict of `{var: score}`

PermutationImportance.abstract_runner._singlethread_iteration(selection_iterator, scoring_fn)[source]¶

Handles a single pass of the abstract variable importance algorithm, assuming a single worker thread

Parameters:	selection_iterator – an iterator which yields triples `(variable, training_data, scoring_data)`. Typically a `PermutationImportance.selection_strategies.SelectionStrategy` scoring_fn – a function to be used for scoring. Should be of the form `(training_data, scoring_data) -> float`
Returns:	a dict of `{var: score}`

PermutationImportance.abstract_runner.abstract_variable_importance(training_data, scoring_data, scoring_fn, scoring_strategy, selection_strategy, variable_names=None, nimportant_vars=None, method=None, njobs=1)[source]¶

Performs an abstract variable importance over data given a particular set of functions for scoring, determining optimal variables, and selecting data

Parameters:

training_data – a 2-tuple (inputs, outputs) for training in the scoring_fn
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
scoring_fn – a function to be used for scoring. Should be of the form (training_data, scoring_data) -> some_value
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute importance for. Defaults to all variables
method – a string for the name of the method used. Defaults to the name of the selection_strategy if not given
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

Data Verification¶

These utilities are designed to check whether the given data and variable names match the expected format. For the training or scoring data, we accept either a pandas dataframe with the target column indicated, two different dataframes, or two numpy arrays

PermutationImportance.data_verification.verify_data(data)[source]¶

Verifies that the data tuple is of the right format and coerces it to numpy arrays for the code under the hood

Parameters:	data – one of the following: (pandas dataframe, string for target column), (pandas dataframe for inputs, pandas dataframe for outputs), (numpy array for inputs, numpy array for outputs)
Returns:	(numpy array for input, numpy array for output) or (pandas dataframe for input, pandas dataframe for output)

PermutationImportance.data_verification.determine_variable_names(data, variable_names)[source]¶

Uses data and/or the variable_names to determine what the variable names are. If variable_names is not specified and data is not a pandas dataframe, defaults to the column indices

Parameters:	data – a 2-tuple where the input data is the first item variable_names – either a list of variable names or None
Returns:	a list of variable names

Error Handling¶

There are a handful of different errors and warnings that we can report. This houses all of them and provides information regarding ways to fix them.

exception PermutationImportance.error_handling.AmbiguousProbabilisticForecastsException(truths, predictions, msg=None)[source]¶

Bases: Exception

Thrown when classes were not provided for converting probabilistic predictions to deterministic ones but are required

exception PermutationImportance.error_handling.FullImportanceResultWarning[source]¶

Bases: Warning

Thrown when we try to add a result to a full PermutationImportance.result.ImportanceResult

exception PermutationImportance.error_handling.InvalidDataException(data, msg=None)[source]¶

Bases: Exception

Thrown when the training or scoring data is not of the right type

exception PermutationImportance.error_handling.InvalidInputException(value, msg=None)[source]¶

Bases: Exception

Thrown when the input to the program does not match expectations

exception PermutationImportance.error_handling.InvalidStrategyException(strategy, msg=None, options=None)[source]¶

Bases: Exception

Thrown when a scoring strategy is invalid

exception PermutationImportance.error_handling.UnmatchedLengthPredictionsException(truths, predictions, msg=None)[source]¶

Bases: Exception

Thrown when the number of predictions doesn’t match truths

exception PermutationImportance.error_handling.UnmatchingProbabilisticForecastsException(truths, predictions, msg=None)[source]¶

Bases: Exception

Thrown when the shape of probabilisic predictions doesn’t match the truths

Metrics¶

These are metric functions which can be used to score model predictions against the true values. They are designed to be used either as a component of an scoring_fn of the method-specific variable importance methods or stand-alone as the evaluation_fn of a model-based variable importance method.

In addition to these metrics, all of the metrics and loss functions provided in sklearn.metrics should also work.

PermutationImportance.metrics.gerrity_score(truths, predictions, classes=None)[source]¶

Determines the Gerrity Score, returning a scalar. See here for more details on the Gerrity Score

Parameters:	truths – The true labels of these data predictions – The predictions of the model classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns:	a single value for the gerrity score

PermutationImportance.metrics.peirce_skill_score(truths, predictions, classes=None)[source]¶

Determines the Peirce Skill Score (True Skill Score), returning a scalar. See here for more details on the Peirce Skill Score

Parameters:	truths – The true labels of these data predictions – The predictions of the model classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns:	a single value for the peirce skill score

PermutationImportance.metrics.heidke_skill_score(truths, predictions, classes=None)[source]¶

Determines the Heidke Skill Score, returning a scalar. See here for more details on the Peirce Skill Score

Parameters:	truths – The true labels of these data predictions – The predictions of the model classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns:	a single value for the heidke skill score

Multiprocessing Utils¶

These are utilities designed for carefully handling communication between processes while multithreading.

The code for pool_imap_unordered is copied nearly wholesale from GrantJ’s Stack Overflow answer here. It allows for a lazy imap over an iterable and the return of very large objects

PermutationImportance.multiprocessing_utils.pool_imap_unordered(func, iterable, procs=1)[source]¶

Lazily imaps in an unordered manner over an iterable in parallel as a generator

Author:	Grant Jenks <https://stackoverflow.com/users/232571/grantj>
Parameters:	func – function to perform on each iterable iterable – iterable which has items to map over procs – number of workers in the pool. Defaults to the cpu count
Yields:	the results of the mapping

Permutation Importance¶

Permutation Importance determines which variables are important by comparing performance on a dataset where some of the variables are permuted in their individual columns to performance on the dataset without any permutation. The permutation of an individual variable in this manner has the effect of breaking any relationship between the input variable and the target. The variable which, when permuted, results in the worst performance is typically taken as the most important variable.

Typically, when using a performance metric or skill score with Permutation Importance, the scoring_strategy should be to minimize the performance. On the other hand, when using an error or loss function, the scoring_strategy should be to maximize the error or loss function.

PermutationImportance.permutation_importance.permutation_importance(scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶

Performs permutation importance over data given a particular set of functions for scoring and determining optimal variables

Parameters:

scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
scoring_fn – a function to be used for scoring. Should be of the form (training_data, scoring_data) -> some_value
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute multipass importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

PermutationImportance.permutation_importance.sklearn_permutation_importance(model, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=1, subsample=1, **kwargs)[source]¶

Performs permutation importance for a particular model, scoring_data, evaluation_fn, and strategy for determining optimal variables

Parameters:

model – a trained sklearn model
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute multipass importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs will be passed on to the evaluation_fn

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

Result¶

The ImportanceResult is an object which keeps track of the full context and scoring determined by a variable importance method. Because the variable importance methods iteratively determine the next most important variable, this yields a sequence of pairs of “contexts” (i.e. the previous ranks/scores of variables) and “results” (i.e. the current ranks/scores of variables). This object keeps track of those pairs and additionally provides methods for the easy retrieve of both the results with empty context (singlepass, Breiman) and the most complete context (multipass, Lakshmanan). Further, it enables iteration over the (context, results) pairs and for indexing into the list of pairs.

class PermutationImportance.result.ImportanceResult(method, variable_names, original_score)[source]¶

Bases: object

Houses the result of any importance method, which consists of a sequence of contexts and results. An individual result can only be truly interpreted correctly in light of the corresponding context. This object allows for indexing into the contexts and results and also provides convenience methods for retrieving the results with no context and the most complete context

__getitem__(index)[source]¶: Retrieves the ith pair of (context, result)

__init__(method, variable_names, original_score)[source]¶

Initializes the results object with the method used and a list of variable names

Parameters:	method – string for the type of variable importance used variable_names – a list of names for variables original_score – the score of the model when no variables are important

__iter__()[source]¶: Iterates over pairs of contexts and results

__len__()[source]¶: Returns the total number of results computed

__weakref__¶: list of weak references to the object (if defined)

add_new_results(new_results, next_important_variable=None)[source]¶

Adds a new round of results. Warns if the ImportanceResult is already complete

Parameters:	new_results – a dictionary with keys of variable names and values of `(rank, score)` next_important_variable – variable name of the next most important variable. If not given, will select the variable with the smallest rank

retrieve_multipass()[source]¶: Returns the multipass results as a dictionary with keys of variable names and values of (rank, score).

retrieve_singlepass()[source]¶: Returns the singlepass results as a dictionary with keys of variable names and values of (rank, score).

Scoring Strategies¶

In a variable importance method, the scoring_strategy is a function which is used to determine which of the scores corresponding to a given variable indicates that the variable is “most important”. This will be dependent on the particular type of object which is returned as a score.

Here, we provide a few functions which can be used directly as scoring strategies as well as some utilities for construction scoring strategies. Moreover, we also provide a dictionary of aliases for several commonly used strategies in VALID_SCORING_STRATEGIES.

PermutationImportance.scoring_strategies.verify_scoring_strategy(scoring_strategy)[source]¶

Asserts that the scoring strategy is valid and interprets various strings

Parameters:	scoring_strategy – a function to be used for determining optimal variables or a string. If a function, should be of the form `([some value]) -> index`. If a string, must be one of the options in `VALID_SCORING_STRATEGIES`
Returns:	a function to be used for determining optimal variables

class PermutationImportance.scoring_strategies.indexer_of_converter(indexer, converter)[source]¶

Bases: object

This object is designed to help construct a scoring strategy by breaking the process of determining an optimal score into two pieces: First, each of the scores are converted to a simpler representation. For instance, an array of scores resulting from a bootstrapped evaluation method may be converted to just their mean. Second, each of the simpler representations are compared to determine the index of the one which is most optimal. This is typically just an argmin or argmax call.

__call__(scores)[source]¶: Finds the index of the most “optimal” score in a list

__init__(indexer, converter)[source]¶

Constructs a function which first converts all objects in a list to something simpler and then uses the indexer to determine the index of the most “optimal” one

Parameters:	indexer – a function which converts a list of probably simply values (like numbers) to a single index converter – a function which converts a single more complex object to a simpler one (like a single number)

__weakref__¶: list of weak references to the object (if defined)

Selection Strategies¶

Each of the various variable importance methods uses the same code to compute successively important variables. The only difference between each of these methods is the data which is provided to the scoring function. The SelectionStrategy handles the process of converting the original training and scoring data to the form required for each of the individual variables. This is done by using the current list of important variables to generate a sequence of triples (variable, training_data, scoring_data), which will later be passed to the scoring function to determine the score for variable.

Below, SelectionStrategy encapsulates the base functionality which houses the parameters necessary to produce the generator as well as the default method for providing only the datasets which are necessary to be evaluated. Each of the other classes extends this base class to implement a particular variable importance method.

If you wish to design your own variable importance method, you will want to extend the SelectionStrategy base class in the same way as the other strategies.

class PermutationImportance.selection_strategies.SequentialForwardSelectionStrategy(training_data, scoring_data, num_vars, important_vars)[source]¶

Bases: PermutationImportance.selection_strategies.SelectionStrategy

Sequential Forward Selection tests all variables which are not yet considered important by adding that columns to the other columns which are returned. This means that the shape of the training data will be (num_rows, num_important_vars + 1).

generate_datasets(important_variables)[source]¶

Check each of the non-important variables. Dataset is the columns which are important

Returns:	(training_data, scoring_data)

class PermutationImportance.selection_strategies.SequentialBackwardSelectionStrategy(training_data, scoring_data, num_vars, important_vars)[source]¶

Bases: PermutationImportance.selection_strategies.SelectionStrategy

Sequential Backward Selection tests all variables which are not yet considered important by removing that column from the data. This means that the shape of the training data will be (num_rows, num_vars - num_important_vars - 1).

generate_datasets(important_variables)[source]¶

Check each of the non-important variables. Dataset is the columns which are not important

Yields:	a sequence of (variable being evaluated, columns to include)

class PermutationImportance.selection_strategies.PermutationImportanceSelectionStrategy(training_data, scoring_data, num_vars, important_vars)[source]¶

Bases: PermutationImportance.selection_strategies.SelectionStrategy

Permutation Importance tests all variables which are not yet considered important by shuffling that column in addition to the columns of the variables which are considered important. The shape of the data will remain constant, but at each step, one additional column will be permuted.

__init__(training_data, scoring_data, num_vars, important_vars)[source]¶

Initializes the object by storing the data and keeping track of other important information

Parameters:	training_data – (training_inputs, training_outputs) scoring_data – (scoring_inputs, scoring_outputs) num_vars – integer for the total number of variables important_vars – a list of the indices of variables which are already considered important

generate_datasets(important_variables)[source]¶

Check each of the non-important variables. Dataset has columns which are important shuffled

Returns:	(training_data, scoring_data)

class PermutationImportance.selection_strategies.SelectionStrategy(training_data, scoring_data, num_vars, important_vars)[source]¶

Bases: object

The base SelectionStrategy only provides the tools for storing the data and other important information as well as the convenience method for iterating over the selection strategies triples lazily.

__init__(training_data, scoring_data, num_vars, important_vars)[source]¶

Initializes the object by storing the data and keeping track of other important information

Parameters:	training_data – (training_inputs, training_outputs) scoring_data – (scoring_inputs, scoring_outputs) num_vars – integer for the total number of variables important_vars – a list of the indices of variables which are already considered important

__weakref__¶: list of weak references to the object (if defined)

generate_all_datasets()[source]¶: By default, loops over all variables not yet considered important

generate_datasets(important_variables)[source]¶: Generator which returns triples (variable, training_data, scoring_data)

Sequential Selection¶

Sequential Selection determines which variables are important by evaluating performance on a dataset where only some of the variables are present. Variables which, when present, greatly improve the performance are typically considered important and variables which, when removed, do not or only minorly degrade the performance are typically considered unimportant.

Sequential Forward Selection iteratively adds variables to the set of important variables, meaning that initially the dataset is empty and at each step the number of columns in the dataset increases by 1. A variable which, when added, results in the best performance is typically taken as the most important variable.

Sequential Backward Selection iteratively removes variables from the set of important variables, meaning that initially the dataset is complete and at each step the number of columns in the dataset decreases by 1. A variable which, when removed, results in the best performance is typically taken as the least important variable.

Typically, when using a performance metric or skill score with any Sequential Selection method, the scoring_strategy should be to maximize the performance. On the other hand, when using an error or loss function, the scoring_strategy should be to minimize the error or loss function.

PermutationImportance.sequential_selection.sequential_forward_selection(training_data, scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶

Performs sequential forward selection over data given a particular set of functions for scoring and determining optimal variables

Parameters:

training_data – a 2-tuple (inputs, outputs) for training in the scoring_fn
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
scoring_fn – a function to be used for scoring. Should be of the form (training_data, scoring_data) -> some_value
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

PermutationImportance.sequential_selection.sklearn_sequential_forward_selection(model, training_data, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=None, subsample=1, **kwargs)[source]¶

Performs sequential forward selection for a particular model, scoring_data, evaluation_fn, and strategy for determining optimal variables

Parameters:

model – a sklearn model
training_data – a 2-tuple (inputs, outputs) for training in the scoring_fn
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs will be passed on to the evaluation_fn

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

PermutationImportance.sequential_selection.sequential_backward_selection(training_data, scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶

Performs sequential backward selection over data given a particular set of functions for scoring and determining optimal variables

Parameters:

training_data – a 2-tuple (inputs, outputs) for training in the scoring_fn
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
scoring_fn – a function to be used for scoring. Should be of the form (training_data, scoring_data) -> some_value
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

PermutationImportance.sequential_selection.sklearn_sequential_backward_selection(model, training_data, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=None, subsample=1, **kwargs)[source]¶

Performs sequential backward selection for a particular model, scoring_data, evaluation_fn, and strategy for determining optimal variables

Parameters:

model – a sklearn model
training_data – a 2-tuple (inputs, outputs) for training in the scoring_fn
scoring_data – a 2-tuple (inputs, outputs) for scoring in the scoring_fn
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([some_value]) -> index
variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
nimportant_vars – number of variables to compute importance for. Defaults to all variables
njobs – an integer for the number of threads to use. If negative, will use num_cpus + njobs. Defaults to 1
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs will be passed on to the evaluation_fn

Returns:

PermutationImportance.result.ImportanceResult object which contains the results for each run

Sklearn API¶

While the various variable importance methods can, in general, for many different situations, such as evaluating the model-agnostic presence of information withing a dataset, the most typical application of the method is to determine the importance of variables as evaluated by a particular model. The tools provide here are useful to assist in the training and evaluation of sklearn models. This is done by wrapping the training and evaluation of the model into a single function which is then used as the scoring_fn of a generalized variable importance method.

All of the variable importance methods with a sklearn_ prefix use these tools to determine 1) whether to retrain a model at each step (as is necessary for Sequential Selection, but not for Permutation Importance) and 2) how to evaluate the resulting predictions of a model.

Here, the powerhouse is the model_scorer object, which handles all of the typical use-cases for any model by separately applying a training, prediction, and evaluation function. Supplied with proper functions for each of these, the model_scorer object could also be implemented to score other types of models, such as Keras models.

class PermutationImportance.sklearn_api.model_scorer(model, training_fn, prediction_fn, evaluation_fn, default_score=0.0, nbootstrap=None, subsample=1, **kwargs)[source]¶

Bases: object

General purpose scoring method which takes a particular model, trains the model over the given training data, uses the trained model to predict on the given scoring data, and then evaluates those predictions using some evaluation function. Additionally provides the tools for bootstrapping the scores and providing a distribution of scores to be used for statistics.

__call__(training_data, scoring_data)[source]¶

Uses the training, predicting, and evaluation functions to score the model given the training and scoring data

Parameters:	training_data – (training_input, training_output) scoring_data – (scoring_input, scoring_output)
Returns:	either a single value or an array of values

__init__(model, training_fn, prediction_fn, evaluation_fn, default_score=0.0, nbootstrap=None, subsample=1, **kwargs)[source]¶

Initializes the scoring object by storing the training, predicting, and evaluation functions

Parameters:

model – a scikit-learn model
training_fn – a function for training a scikit-learn model. Must be of the form (model, training_inputs, training_outputs) -> trained_model | None. If the function returns None, then it is assumed that the model training failed. Probably PermutationImportance.sklearn_api.train_model() or PermutationImportance.sklearn_api.get_model()
predicting_fn – a function for predicting on scoring data using a scikit-learn model. Must be of the form (model, scoring_inputs) -> predictions. Predictions may be either deterministic or probabilistic, depending on what the evaluation_fn accepts. Probably PermutationImportance.sklearn_api.predict_model() or PermutationImportance.sklearn_api.predict_proba_model()
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
default_score – value to return if the model cannot be trained
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to None, which will not perform bootstrapping
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)

__weakref__¶: list of weak references to the object (if defined)

PermutationImportance.sklearn_api.score_untrained_sklearn_model(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶

A convenience method which uses the default training and the deterministic prediction methods for scikit-learn to evaluate a model

Parameters:

model – a scikit-learn model
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs passed on to the evaluation_fn

Returns:

a callable which accepts (training_data, scoring_data) and returns some value (probably a float or an array of floats)

PermutationImportance.sklearn_api.score_untrained_sklearn_model_with_probabilities(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶

A convenience method which uses the default training and the probabilistic prediction methods for scikit-learn to evaluate a model

Parameters:

model – a scikit-learn model
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs passed on to the evaluation_fn

Returns:

a callable which accepts (training_data, scoring_data) and returns some value (probably a float or an array of floats)

PermutationImportance.sklearn_api.score_trained_sklearn_model(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶

A convenience method which does not retrain a scikit-learn model and uses deterministic prediction methods to evaluate the model

Parameters:

model – a scikit-learn model
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs passed on to the evaluation_fn

Returns:

a callable which accepts (training_data, scoring_data) and returns some value (probably a float or an array of floats)

PermutationImportance.sklearn_api.score_trained_sklearn_model_with_probabilities(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶

A convenience method which does not retrain a scikit-learn model and uses probabilistic prediction methods to evaluate the model

Parameters:

model – a scikit-learn model
evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form (truths, predictions) -> some_value Probably one of the metrics in PermutationImportance.metrics or sklearn.metrics
nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
kwargs – all other kwargs passed on to the evaluation_fn

Returns:

a callable which accepts (training_data, scoring_data) and returns some value (probably a float or an array of floats)

PermutationImportance.sklearn_api.train_model(model, training_inputs, training_outputs)[source]¶: Trains a scikit-learn model and returns the trained model

PermutationImportance.sklearn_api.get_model(model, training_inputs, training_outputs)[source]¶: Just return the trained model

PermutationImportance.sklearn_api.predict_model(model, scoring_inputs)[source]¶: Uses a trained scikit-learn model to predict over the scoring data

PermutationImportance.sklearn_api.predict_proba_model(model, scoring_inputs)[source]¶: Uses a trained scikit-learn model to predict class probabilities for the scoring data

Utils¶

Various and sundry useful functions which are handy for manipulating data or results of the variable importance

PermutationImportance.utils.add_ranks_to_dict(result, variable_names, scoring_strategy)[source]¶

Takes a list of (var, score) and converts to a dictionary of {var: (rank, score)}

Parameters:	result – a dict of {var_index: score} variable_names – a list of variable names scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([floats]) -> index

PermutationImportance.utils.get_data_subset(data, rows=None, columns=None)[source]¶

Returns a subset of the data corresponding to the desired rows and columns

Parameters:	data – either a pandas dataframe or a numpy array rows – a list of row indices columns – a list of column indices
Returns:	data_subset (same type as data)

PermutationImportance.utils.make_data_from_columns(columns_list, index=None)[source]¶

Synthesizes a dataset out of a list of columns

Parameters:	columns_list – a list of either pandas series or numpy arrays
Returns:	a pandas dataframe or a numpy array

Modules¶

Abstract Runner¶

Data Verification¶

Error Handling¶

Metrics¶

Multiprocessing Utils¶

Permutation Importance¶

Result¶

Scoring Strategies¶

Selection Strategies¶

Sequential Selection¶

Sklearn API¶

Utils¶

Table of Contents

Previous topic

This Page