{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Recursive feature elimination with cross-validation\n\nA Recursive Feature Elimination (RFE) example with automatic tuning of the\nnumber of features selected with cross-validation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data generation\n\nWe build a classification task using 3 informative features. The introduction\nof 2 additional redundant (i.e. correlated) features has the effect that the\nselected features vary depending on the cross-validation fold. The remaining\nfeatures are non-informative as they are drawn at random.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_classification\n\nn_features = 15\nfeat_names = [f\"feature_{i}\" for i in range(15)]\n\nX, y = make_classification(\n    n_samples=500,\n    n_features=n_features,\n    n_informative=3,\n    n_redundant=2,\n    n_repeated=0,\n    n_classes=8,\n    n_clusters_per_class=1,\n    class_sep=0.8,\n    random_state=0,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Model training and selection\n\nWe create the RFE object and compute the cross-validated scores. The scoring\nstrategy \"accuracy\" optimizes the proportion of correctly classified samples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.feature_selection import RFECV\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import StratifiedKFold\n\nmin_features_to_select = 1  # Minimum number of features to consider\nclf = LogisticRegression()\ncv = StratifiedKFold(5)\n\nrfecv = RFECV(\n    estimator=clf,\n    step=1,\n    cv=cv,\n    scoring=\"accuracy\",\n    min_features_to_select=min_features_to_select,\n    n_jobs=2,\n)\nrfecv.fit(X, y)\n\nprint(f\"Optimal number of features: {rfecv.n_features_}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the present case, the model with 3 features (which corresponds to the true\ngenerative model) is found to be the most optimal.\n\n## Plot number of features VS. cross-validation scores\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport pandas as pd\n\ndata = {\n    key: value\n    for key, value in rfecv.cv_results_.items()\n    if key in [\"n_features\", \"mean_test_score\", \"std_test_score\"]\n}\ncv_results = pd.DataFrame(data)\nplt.figure()\nplt.xlabel(\"Number of features selected\")\nplt.ylabel(\"Mean test accuracy\")\nplt.errorbar(\n    x=cv_results[\"n_features\"],\n    y=cv_results[\"mean_test_score\"],\n    yerr=cv_results[\"std_test_score\"],\n)\nplt.title(\"Recursive Feature Elimination \\nwith correlated features\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the plot above one can further notice a plateau of equivalent scores\n(similar mean value and overlapping errorbars) for 3 to 5 selected features.\nThis is the result of introducing correlated features. Indeed, the optimal\nmodel selected by the RFE can lie within this range, depending on the\ncross-validation technique. The test accuracy decreases above 5 selected\nfeatures, this is, keeping non-informative features leads to over-fitting and\nis therefore detrimental for the statistical performance of the models.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nfor i in range(cv.n_splits):\n    mask = rfecv.cv_results_[f\"split{i}_support\"][\n        rfecv.n_features_ - 1\n    ]  # mask of features selected by the RFE\n    features_selected = np.ma.compressed(np.ma.masked_array(feat_names, mask=1 - mask))\n    print(f\"Features selected in fold {i}: {features_selected}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the five folds, the selected features are consistent. This is good news,\nit means that the selection is stable across folds, and it confirms that\nthese features are the most informative ones.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}