(Feature Influence)
PD captures the average response of a predictive model for a collection of instances when varying one of their features (Friedman 2001).
It communicates global (with respect to the entire explained model) feature influence.
Property | Partial Dependence |
---|---|
relation | post-hoc |
compatibility | model-agnostic |
modelling | regression, crisp and probabilistic classification |
scope | global (per data set; generalises to cohort) |
target | model (set of predictions) |
Property | Partial Dependence |
---|---|
data | tabular |
features | numerical and categorical |
explanation | feature influence (visualisation) |
caveats | feature correlation, unrealistic instances, heterogeneous model response |
\[ X_{\mathit{PD}} \subseteq \mathcal{X} \]
\[ V_i = \{ v_i^{\mathit{min}} , \ldots , v_i^{\mathit{max}} \} \]
\[ \mathit{PD}_i = \mathbb{E}_{X_{\setminus i}} \left[ f \left( X_{\setminus i} , x_i=v_i \right) \right] = \int f \left( X_{\setminus i} , x_i=v_i \right) \; d \mathbb{P} ( X_{\setminus i} ) \;\; \forall \; v_i \in V_i \]
\[ \mathit{PD}_i = \mathbb{E}_{X_{\setminus i}} \left[ f \left( X_{\setminus i} , x_i=V_i \right) \right] = \int f \left( X_{\setminus i} , x_i=V_i \right) \; d \mathbb{P} ( X_{\setminus i} ) \]
Based on the ICE notation (Goldstein et al. 2015)
\[ \left\{ \left( x_{S}^{(i)} , x_{C}^{(i)} \right) \right\}_{i=1}^N \]
\[ \hat{f}_S = \mathbb{E}_{X_{C}} \left[ \hat{f} \left( x_{S} , X_{C} \right) \right] = \int \hat{f} \left( x_{S} , X_{C} \right) \; d \mathbb{P} ( X_{C} ) \]
(Monte Carlo approximation)
\[ \mathit{PD}_i \approx \frac{1}{|X_{\mathit{PD}}|} \sum_{x \in X_{\mathit{PD}}} f \left( x_ {\setminus i} , x_i=v_i \right) \]
\[ \hat{f}_S \approx \frac{1}{N} \sum_{i = 1}^N \hat{f} \left( x_{S} , x_{C}^{(i)} \right) \]
Centres PD curve by anchoring it at a fixed point, usually the lower end of the explained feature range. It is helpful when working with Centred ICE.
\[ \mathbb{E}_{X_{\setminus i}} \left[ f \left( X_{\setminus i} , x_i=V_i \right) \right] - \mathbb{E}_{X_{\setminus i}} \left[ f \left( X_{\setminus i} , x_i=v_i^{\mathit{min}} \right) \right] \]
or
\[ \mathbb{E}_{X_{C}} \left[ \hat{f} \left( x_{S}^{(i)} , X_{C} \right) \right] - \mathbb{E}_{X_{C}} \left[ \hat{f} \left( x^{\star} , X_{C} \right) \right] \]
Importance of a feature can be derived from a PD curve by assessing its flatness (Greenwell, Boehmke, and McCarthy 2018). A flat PD line indicates that the model is not overly sensitive to the values of the selected feature, hence it is not important for the model’s decisions.
For example, for numerical features, it can be defined as the (standard) deviation of PD measurement for each unique value of the explained feature from the average PD.
\[ I_{\mathit{PD}} (i) = \sqrt{ \frac{1}{|V_i| - 1} \sum_{v_i \in V_i} \left( \mathit{PD}_i - \underbrace{ \frac{1}{|V_i|} \sum_{v_i \in V_i} \mathit{PD}_i }_{\text{average PD}} \right)^2 } \]
For categorical features, it can be defined as the range statistic divided by four (range rule) of PD values, which provides a rough estimate of the standard deviation.
\[ I_{\mathit{PD}} (i) = \frac{ \max_{V_i} \; \mathit{PD}_i - \min_{V_i} \; \mathit{PD}_i }{ 4 } \]
Based on the ICE notation (Goldstein et al. 2015), where \(K\) is the number of unique values \(x_S^{(k)}\) of the explained feature \(x_S\)
\[ I_{\mathit{PD}} (x_S) = \sqrt{ \frac{1}{K - 1} \sum_{k=1}^K \left( \hat{f}_S(x^{(k)}_S) - \underbrace{ \frac{1}{K} \sum_{k=1}^K \hat{f}_S(x^{(k)}_S) }_{\text{average PD}} \right)^2 } \]
\[ I_{\mathit{PD}} (x_S) = \frac{ \max_{k} \; \hat{f}_S(x^{(k)}_S) - \min_{k} \; \hat{f}_S(x^{(k)}_S) }{ 4 } \]
Assumes feature independence, which is often unreasonable
PD may not reflect the true behaviour of the model since it based upon the behaviour of the model for unrealistic instances
May be unreliable for certain values of the explained feature when its values are not uniformly distributed (abated by a rug plot)
Limited to explaining two feature at a time
Does not capture the diversity (heterogeneity) of the model’s behaviour for the individual instances used for PD calculation (abated by displaying the underlying ICE lines)
Zhao and Hastie (2021) noticed similarity in the formulation of Partial Dependence and Pearl’s back-door criterion (Pearl, Glymour, and Jewell 2016), allowing for a causal interpretation of PD under quite restrictive assumptions:
By interveening on the explained feature, we measure the change in the model’s output, allowing us to analyse the causal relationship between the two.
Instance-focused (local) “version” of Partial Dependence, which communicates the influence of a specific feature value on the model’s prediction by fixing the value of this feature for a single data point (Goldstein et al. 2015).
It communicates the influence of a specific feature value – or similar values, i.e., an interval around the selected value – on the model’s prediction by only considering relevant instances found in the designated data set. It is calculated as the average prediction of these instances.
It communicates the influence of a specific feature value on the model’s prediction by quantifying the average (accumulated) difference between the predictions at the boundaries of a (small) fixed interval around the selected feature value (Apley and Zhu 2020). It is calculated by replacing the value of the explained feature with the interval boundaries for instances found in the designated data set whose value of this feature is within the specified range.
Python | R |
---|---|
scikit-learn (>=0.24.0 ) |
iml |
PyCEbox | ICEbox |
PDPbox | pdp |
InterpretML | DALEX |
Skater | |
alibi |