Modules¶
Abstract Runner¶
The general algorithm for all of the data-based variable importance methods
is the same, regardless of whether the method is Sequential Selection or
Permutation Importance or something else. This is represented in the
abstract_variable_importance
function. All of the different methods we
provide use this function under the hood and the only difference between them is
the selection_strategy
object, which is detailed in
PermutationImportance.selection_strategies
. Typically, you will not need
to use this method but can instead use one of the methods imported directly into
the top package of PermutationImportance.
If you wish to implement your own variable importance method, you will need to
devise your own selection_strategy
. We recommend using
PermutationImportance.selection_strategies
as a template for implementing
your own variable importance method.
-
PermutationImportance.abstract_runner.
_multithread_iteration
(selection_iterator, scoring_fn, njobs)[source]¶ Handles a single pass of the abstract variable importance algorithm using multithreading
Parameters: - selection_iterator – an iterator which yields triples
(variable, training_data, scoring_data)
. Typically aPermutationImportance.selection_strategies.SelectionStrategy
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> float
- num_jobs – number of processes to use
Returns: a dict of
{var: score}
- selection_iterator – an iterator which yields triples
-
PermutationImportance.abstract_runner.
_singlethread_iteration
(selection_iterator, scoring_fn)[source]¶ Handles a single pass of the abstract variable importance algorithm, assuming a single worker thread
Parameters: - selection_iterator – an iterator which yields triples
(variable, training_data, scoring_data)
. Typically aPermutationImportance.selection_strategies.SelectionStrategy
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> float
Returns: a dict of
{var: score}
- selection_iterator – an iterator which yields triples
-
PermutationImportance.abstract_runner.
abstract_variable_importance
(training_data, scoring_data, scoring_fn, scoring_strategy, selection_strategy, variable_names=None, nimportant_vars=None, method=None, njobs=1)[source]¶ Performs an abstract variable importance over data given a particular set of functions for scoring, determining optimal variables, and selecting data
Parameters: - training_data – a 2-tuple
(inputs, outputs)
for training in thescoring_fn
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> some_value
- scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute importance for. Defaults to all variables
- method – a string for the name of the method used. Defaults to the
name of the
selection_strategy
if not given - njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run- training_data – a 2-tuple
Data Verification¶
These utilities are designed to check whether the given data and variable names match the expected format. For the training or scoring data, we accept either a pandas dataframe with the target column indicated, two different dataframes, or two numpy arrays
-
PermutationImportance.data_verification.
verify_data
(data)[source]¶ Verifies that the data tuple is of the right format and coerces it to numpy arrays for the code under the hood
Parameters: data – one of the following: (pandas dataframe, string for target column), (pandas dataframe for inputs, pandas dataframe for outputs), (numpy array for inputs, numpy array for outputs) Returns: (numpy array for input, numpy array for output) or (pandas dataframe for input, pandas dataframe for output)
-
PermutationImportance.data_verification.
determine_variable_names
(data, variable_names)[source]¶ Uses
data
and/or thevariable_names
to determine what the variable names are. Ifvariable_names
is not specified anddata
is not a pandas dataframe, defaults to the column indicesParameters: - data – a 2-tuple where the input data is the first item
- variable_names – either a list of variable names or None
Returns: a list of variable names
Error Handling¶
There are a handful of different errors and warnings that we can report. This houses all of them and provides information regarding ways to fix them.
-
exception
PermutationImportance.error_handling.
AmbiguousProbabilisticForecastsException
(truths, predictions, msg=None)[source]¶ Bases:
Exception
Thrown when classes were not provided for converting probabilistic predictions to deterministic ones but are required
-
exception
PermutationImportance.error_handling.
FullImportanceResultWarning
[source]¶ Bases:
Warning
Thrown when we try to add a result to a full
PermutationImportance.result.ImportanceResult
-
exception
PermutationImportance.error_handling.
InvalidDataException
(data, msg=None)[source]¶ Bases:
Exception
Thrown when the training or scoring data is not of the right type
-
exception
PermutationImportance.error_handling.
InvalidInputException
(value, msg=None)[source]¶ Bases:
Exception
Thrown when the input to the program does not match expectations
-
exception
PermutationImportance.error_handling.
InvalidStrategyException
(strategy, msg=None, options=None)[source]¶ Bases:
Exception
Thrown when a scoring strategy is invalid
Metrics¶
These are metric functions which can be used to score model predictions
against the true values. They are designed to be used either as a component of
an scoring_fn
of the method-specific variable importance methods or
stand-alone as the evaluation_fn
of a model-based variable importance
method.
In addition to these metrics, all of the metrics and loss functions provided in sklearn.metrics should also work.
-
PermutationImportance.metrics.
gerrity_score
(truths, predictions, classes=None)[source]¶ Determines the Gerrity Score, returning a scalar. See here for more details on the Gerrity Score
Parameters: - truths – The true labels of these data
- predictions – The predictions of the model
- classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns: a single value for the gerrity score
-
PermutationImportance.metrics.
peirce_skill_score
(truths, predictions, classes=None)[source]¶ Determines the Peirce Skill Score (True Skill Score), returning a scalar. See here for more details on the Peirce Skill Score
Parameters: - truths – The true labels of these data
- predictions – The predictions of the model
- classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns: a single value for the peirce skill score
-
PermutationImportance.metrics.
heidke_skill_score
(truths, predictions, classes=None)[source]¶ Determines the Heidke Skill Score, returning a scalar. See here for more details on the Peirce Skill Score
Parameters: - truths – The true labels of these data
- predictions – The predictions of the model
- classes – an ordered set for the label possibilities. If not given, will be deduced from the truth values
Returns: a single value for the heidke skill score
Multiprocessing Utils¶
These are utilities designed for carefully handling communication between processes while multithreading.
The code for pool_imap_unordered
is copied nearly wholesale from GrantJ’s
Stack Overflow answer here.
It allows for a lazy imap over an iterable and the return of very large objects
-
PermutationImportance.multiprocessing_utils.
pool_imap_unordered
(func, iterable, procs=1)[source]¶ Lazily imaps in an unordered manner over an iterable in parallel as a generator
Author: Grant Jenks <https://stackoverflow.com/users/232571/grantj>
Parameters: - func – function to perform on each iterable
- iterable – iterable which has items to map over
- procs – number of workers in the pool. Defaults to the cpu count
Yields: the results of the mapping
Permutation Importance¶
Permutation Importance determines which variables are important by comparing performance on a dataset where some of the variables are permuted in their individual columns to performance on the dataset without any permutation. The permutation of an individual variable in this manner has the effect of breaking any relationship between the input variable and the target. The variable which, when permuted, results in the worst performance is typically taken as the most important variable.
Typically, when using a performance metric or skill score with Permutation
Importance, the scoring_strategy
should be to minimize the performance. On
the other hand, when using an error or loss function, the scoring_strategy
should be to maximize the error or loss function.
-
PermutationImportance.permutation_importance.
permutation_importance
(scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶ Performs permutation importance over data given a particular set of functions for scoring and determining optimal variables
Parameters: - scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> some_value
- scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute multipass importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run- scoring_data – a 2-tuple
-
PermutationImportance.permutation_importance.
sklearn_permutation_importance
(model, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=1, subsample=1, **kwargs)[source]¶ Performs permutation importance for a particular model,
scoring_data
,evaluation_fn
, and strategy for determining optimal variablesParameters: - model – a trained sklearn model
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute multipass importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1 - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs will be passed on to the
evaluation_fn
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run
Result¶
The ImportanceResult
is an object which keeps track of the full context
and scoring determined by a variable importance method. Because the variable
importance methods iteratively determine the next most important variable, this
yields a sequence of pairs of “contexts” (i.e. the previous ranks/scores of
variables) and “results” (i.e. the current ranks/scores of variables). This
object keeps track of those pairs and additionally provides methods for the easy
retrieve of both the results with empty context (singlepass, Breiman) and the
most complete context (multipass, Lakshmanan). Further, it enables iteration
over the (context, results)
pairs and for indexing into the list of pairs.
-
class
PermutationImportance.result.
ImportanceResult
(method, variable_names, original_score)[source]¶ Bases:
object
Houses the result of any importance method, which consists of a sequence of contexts and results. An individual result can only be truly interpreted correctly in light of the corresponding context. This object allows for indexing into the contexts and results and also provides convenience methods for retrieving the results with no context and the most complete context
-
__init__
(method, variable_names, original_score)[source]¶ Initializes the results object with the method used and a list of variable names
Parameters: - method – string for the type of variable importance used
- variable_names – a list of names for variables
- original_score – the score of the model when no variables are important
-
__weakref__
¶ list of weak references to the object (if defined)
-
add_new_results
(new_results, next_important_variable=None)[source]¶ Adds a new round of results. Warns if the ImportanceResult is already complete
Parameters: - new_results – a dictionary with keys of variable names and values
of
(rank, score)
- next_important_variable – variable name of the next most important variable. If not given, will select the variable with the smallest rank
- new_results – a dictionary with keys of variable names and values
of
-
Scoring Strategies¶
In a variable importance method, the scoring_strategy
is a function which
is used to determine which of the scores corresponding to a given variable
indicates that the variable is “most important”. This will be dependent on the
particular type of object which is returned as a score.
Here, we provide a few functions which can be used directly as scoring
strategies as well as some utilities for construction scoring strategies.
Moreover, we also provide a dictionary of aliases for several commonly used
strategies in VALID_SCORING_STRATEGIES
.
-
PermutationImportance.scoring_strategies.
verify_scoring_strategy
(scoring_strategy)[source]¶ Asserts that the scoring strategy is valid and interprets various strings
Parameters: scoring_strategy – a function to be used for determining optimal variables or a string. If a function, should be of the form ([some value]) -> index
. If a string, must be one of the options inVALID_SCORING_STRATEGIES
Returns: a function to be used for determining optimal variables
-
class
PermutationImportance.scoring_strategies.
indexer_of_converter
(indexer, converter)[source]¶ Bases:
object
This object is designed to help construct a scoring strategy by breaking the process of determining an optimal score into two pieces: First, each of the scores are converted to a simpler representation. For instance, an array of scores resulting from a bootstrapped evaluation method may be converted to just their mean. Second, each of the simpler representations are compared to determine the index of the one which is most optimal. This is typically just an
argmin
orargmax
call.-
__init__
(indexer, converter)[source]¶ Constructs a function which first converts all objects in a list to something simpler and then uses the indexer to determine the index of the most “optimal” one
Parameters: - indexer – a function which converts a list of probably simply values (like numbers) to a single index
- converter – a function which converts a single more complex object to a simpler one (like a single number)
-
__weakref__
¶ list of weak references to the object (if defined)
-
Selection Strategies¶
Each of the various variable importance methods uses the same code to compute
successively important variables. The only difference between each of these
methods is the data which is provided to the scoring function. The
SelectionStrategy
handles the process of converting the original training
and scoring data to the form required for each of the individual variables. This
is done by using the current list of important variables to generate a sequence
of triples (variable, training_data, scoring_data)
, which will later be
passed to the scoring function to determine the score for variable.
Below, SelectionStrategy
encapsulates the base functionality which houses the
parameters necessary to produce the generator as well as the default method for
providing only the datasets which are necessary to be evaluated. Each of the
other classes extends this base class to implement a particular variable
importance method.
If you wish to design your own variable importance method, you will want to
extend the SelectionStrategy
base class in the same way as the other
strategies.
-
class
PermutationImportance.selection_strategies.
SequentialForwardSelectionStrategy
(training_data, scoring_data, num_vars, important_vars)[source]¶ Bases:
PermutationImportance.selection_strategies.SelectionStrategy
Sequential Forward Selection tests all variables which are not yet considered important by adding that columns to the other columns which are returned. This means that the shape of the training data will be
(num_rows, num_important_vars + 1)
.
-
class
PermutationImportance.selection_strategies.
SequentialBackwardSelectionStrategy
(training_data, scoring_data, num_vars, important_vars)[source]¶ Bases:
PermutationImportance.selection_strategies.SelectionStrategy
Sequential Backward Selection tests all variables which are not yet considered important by removing that column from the data. This means that the shape of the training data will be
(num_rows, num_vars - num_important_vars - 1)
.
-
class
PermutationImportance.selection_strategies.
PermutationImportanceSelectionStrategy
(training_data, scoring_data, num_vars, important_vars)[source]¶ Bases:
PermutationImportance.selection_strategies.SelectionStrategy
Permutation Importance tests all variables which are not yet considered important by shuffling that column in addition to the columns of the variables which are considered important. The shape of the data will remain constant, but at each step, one additional column will be permuted.
-
__init__
(training_data, scoring_data, num_vars, important_vars)[source]¶ Initializes the object by storing the data and keeping track of other important information
Parameters: - training_data – (training_inputs, training_outputs)
- scoring_data – (scoring_inputs, scoring_outputs)
- num_vars – integer for the total number of variables
- important_vars – a list of the indices of variables which are already considered important
-
-
class
PermutationImportance.selection_strategies.
SelectionStrategy
(training_data, scoring_data, num_vars, important_vars)[source]¶ Bases:
object
The base
SelectionStrategy
only provides the tools for storing the data and other important information as well as the convenience method for iterating over the selection strategies triples lazily.-
__init__
(training_data, scoring_data, num_vars, important_vars)[source]¶ Initializes the object by storing the data and keeping track of other important information
Parameters: - training_data – (training_inputs, training_outputs)
- scoring_data – (scoring_inputs, scoring_outputs)
- num_vars – integer for the total number of variables
- important_vars – a list of the indices of variables which are already considered important
-
__weakref__
¶ list of weak references to the object (if defined)
-
Sequential Selection¶
Sequential Selection determines which variables are important by evaluating performance on a dataset where only some of the variables are present. Variables which, when present, greatly improve the performance are typically considered important and variables which, when removed, do not or only minorly degrade the performance are typically considered unimportant.
Sequential Forward Selection iteratively adds variables to the set of important variables, meaning that initially the dataset is empty and at each step the number of columns in the dataset increases by 1. A variable which, when added, results in the best performance is typically taken as the most important variable.
Sequential Backward Selection iteratively removes variables from the set of important variables, meaning that initially the dataset is complete and at each step the number of columns in the dataset decreases by 1. A variable which, when removed, results in the best performance is typically taken as the least important variable.
Typically, when using a performance metric or skill score with any Sequential
Selection method, the scoring_strategy
should be to maximize the
performance. On the other hand, when using an error or loss function, the
scoring_strategy
should be to minimize the error or loss function.
-
PermutationImportance.sequential_selection.
sequential_forward_selection
(training_data, scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶ Performs sequential forward selection over data given a particular set of functions for scoring and determining optimal variables
Parameters: - training_data – a 2-tuple
(inputs, outputs)
for training in thescoring_fn
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> some_value
- scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run- training_data – a 2-tuple
-
PermutationImportance.sequential_selection.
sklearn_sequential_forward_selection
(model, training_data, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=None, subsample=1, **kwargs)[source]¶ Performs sequential forward selection for a particular model,
scoring_data
,evaluation_fn
, and strategy for determining optimal variablesParameters: - model – a sklearn model
- training_data – a 2-tuple
(inputs, outputs)
for training in thescoring_fn
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1 - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs will be passed on to the
evaluation_fn
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run
-
PermutationImportance.sequential_selection.
sequential_backward_selection
(training_data, scoring_data, scoring_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1)[source]¶ Performs sequential backward selection over data given a particular set of functions for scoring and determining optimal variables
Parameters: - training_data – a 2-tuple
(inputs, outputs)
for training in thescoring_fn
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- scoring_fn – a function to be used for scoring. Should be of the form
(training_data, scoring_data) -> some_value
- scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run- training_data – a 2-tuple
-
PermutationImportance.sequential_selection.
sklearn_sequential_backward_selection
(model, training_data, scoring_data, evaluation_fn, scoring_strategy, variable_names=None, nimportant_vars=None, njobs=1, nbootstrap=None, subsample=1, **kwargs)[source]¶ Performs sequential backward selection for a particular model,
scoring_data
,evaluation_fn
, and strategy for determining optimal variablesParameters: - model – a sklearn model
- training_data – a 2-tuple
(inputs, outputs)
for training in thescoring_fn
- scoring_data – a 2-tuple
(inputs, outputs)
for scoring in thescoring_fn
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - scoring_strategy – a function to be used for determining optimal
variables. Should be of the form
([some_value]) -> index
- variable_names – an optional list for variable names. If not given, will use names of columns of data (if pandas dataframe) or column indices
- nimportant_vars – number of variables to compute importance for. Defaults to all variables
- njobs – an integer for the number of threads to use. If negative, will
use
num_cpus + njobs
. Defaults to 1 - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs will be passed on to the
evaluation_fn
Returns: PermutationImportance.result.ImportanceResult
object which contains the results for each run
Sklearn API¶
While the various variable importance methods can, in general, for many
different situations, such as evaluating the model-agnostic presence of
information withing a dataset, the most typical application of the method is to
determine the importance of variables as evaluated by a particular model. The
tools provide here are useful to assist in the training and evaluation of
sklearn models. This is done by wrapping the training and evaluation of the
model into a single function which is then used as the scoring_fn
of a
generalized variable importance method.
All of the variable importance methods with a sklearn_
prefix use these
tools to determine 1) whether to retrain a model at each step (as is necessary
for Sequential Selection, but not for Permutation Importance) and 2) how to
evaluate the resulting predictions of a model.
Here, the powerhouse is the model_scorer
object, which handles all of the
typical use-cases for any model by separately applying a training, prediction,
and evaluation function. Supplied with proper functions for each of these, the
model_scorer
object could also be implemented to score other types of
models, such as Keras models.
-
class
PermutationImportance.sklearn_api.
model_scorer
(model, training_fn, prediction_fn, evaluation_fn, default_score=0.0, nbootstrap=None, subsample=1, **kwargs)[source]¶ Bases:
object
General purpose scoring method which takes a particular model, trains the model over the given training data, uses the trained model to predict on the given scoring data, and then evaluates those predictions using some evaluation function. Additionally provides the tools for bootstrapping the scores and providing a distribution of scores to be used for statistics.
-
__call__
(training_data, scoring_data)[source]¶ Uses the training, predicting, and evaluation functions to score the model given the training and scoring data
Parameters: - training_data – (training_input, training_output)
- scoring_data – (scoring_input, scoring_output)
Returns: either a single value or an array of values
-
__init__
(model, training_fn, prediction_fn, evaluation_fn, default_score=0.0, nbootstrap=None, subsample=1, **kwargs)[source]¶ Initializes the scoring object by storing the training, predicting, and evaluation functions
Parameters: - model – a scikit-learn model
- training_fn – a function for training a scikit-learn model. Must
be of the form
(model, training_inputs, training_outputs) -> trained_model | None
. If the function returnsNone
, then it is assumed that the model training failed. ProbablyPermutationImportance.sklearn_api.train_model()
orPermutationImportance.sklearn_api.get_model()
- predicting_fn – a function for predicting on scoring data using a
scikit-learn model. Must be of the form
(model, scoring_inputs) -> predictions
. Predictions may be either deterministic or probabilistic, depending on what the evaluation_fn accepts. ProbablyPermutationImportance.sklearn_api.predict_model()
orPermutationImportance.sklearn_api.predict_proba_model()
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - default_score – value to return if the model cannot be trained
- nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to None, which will not perform bootstrapping
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
PermutationImportance.sklearn_api.
score_untrained_sklearn_model
(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶ A convenience method which uses the default training and the deterministic prediction methods for scikit-learn to evaluate a model
Parameters: - model – a scikit-learn model
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs passed on to the evaluation_fn
Returns: a callable which accepts
(training_data, scoring_data)
and returns some value (probably a float or an array of floats)
-
PermutationImportance.sklearn_api.
score_untrained_sklearn_model_with_probabilities
(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶ A convenience method which uses the default training and the probabilistic prediction methods for scikit-learn to evaluate a model
Parameters: - model – a scikit-learn model
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs passed on to the evaluation_fn
Returns: a callable which accepts
(training_data, scoring_data)
and returns some value (probably a float or an array of floats)
-
PermutationImportance.sklearn_api.
score_trained_sklearn_model
(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶ A convenience method which does not retrain a scikit-learn model and uses deterministic prediction methods to evaluate the model
Parameters: - model – a scikit-learn model
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs passed on to the evaluation_fn
Returns: a callable which accepts
(training_data, scoring_data)
and returns some value (probably a float or an array of floats)
-
PermutationImportance.sklearn_api.
score_trained_sklearn_model_with_probabilities
(model, evaluation_fn, nbootstrap=None, subsample=1, **kwargs)[source]¶ A convenience method which does not retrain a scikit-learn model and uses probabilistic prediction methods to evaluate the model
Parameters: - model – a scikit-learn model
- evaluation_fn –
a function which takes the deterministic or probabilistic model predictions and scores them against the true values. Must be of the form
(truths, predictions) -> some_value
Probably one of the metrics inPermutationImportance.metrics
or sklearn.metrics - nbootstrap – number of times to perform scoring on each variable. Results over different bootstrap iterations are averaged. Defaults to 1
- subsample – number of elements to sample (with replacement) per bootstrap round. If between 0 and 1, treated as a fraction of the number of total number of events (e.g. 0.5 means half the number of events). If not specified, subsampling will not be used and the entire data will be used (without replacement)
- kwargs – all other kwargs passed on to the evaluation_fn
Returns: a callable which accepts
(training_data, scoring_data)
and returns some value (probably a float or an array of floats)
-
PermutationImportance.sklearn_api.
train_model
(model, training_inputs, training_outputs)[source]¶ Trains a scikit-learn model and returns the trained model
-
PermutationImportance.sklearn_api.
get_model
(model, training_inputs, training_outputs)[source]¶ Just return the trained model
Utils¶
Various and sundry useful functions which are handy for manipulating data or results of the variable importance
-
PermutationImportance.utils.
add_ranks_to_dict
(result, variable_names, scoring_strategy)[source]¶ Takes a list of (var, score) and converts to a dictionary of {var: (rank, score)}
Parameters: - result – a dict of {var_index: score}
- variable_names – a list of variable names
- scoring_strategy – a function to be used for determining optimal variables. Should be of the form ([floats]) -> index
-
PermutationImportance.utils.
get_data_subset
(data, rows=None, columns=None)[source]¶ Returns a subset of the data corresponding to the desired rows and columns
Parameters: - data – either a pandas dataframe or a numpy array
- rows – a list of row indices
- columns – a list of column indices
Returns: data_subset (same type as data)