{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Introducing the `set_output` API\n\n.. currentmodule:: sklearn\n\nThis example will demonstrate the `set_output` API to configure transformers to\noutput pandas DataFrames. `set_output` can be configured per estimator by calling\nthe `set_output` method or globally by setting `set_config(transform_output=\"pandas\")`.\nFor details, see\n[SLEP018](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html)_.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we load the iris dataset as a DataFrame to demonstrate the `set_output` API.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\nX, y = load_iris(as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\nX_train.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To configure an estimator such as :class:`preprocessing.StandardScaler` to return\nDataFrames, call `set_output`. This feature requires pandas to be installed.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler().set_output(transform=\"pandas\")\n\nscaler.fit(X_train)\nX_test_scaled = scaler.transform(X_test)\nX_test_scaled.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`set_output` can be called after `fit` to configure `transform` after the fact.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "scaler2 = StandardScaler()\n\nscaler2.fit(X_train)\nX_test_np = scaler2.transform(X_test)\nprint(f\"Default output type: {type(X_test_np).__name__}\")\n\nscaler2.set_output(transform=\"pandas\")\nX_test_df = scaler2.transform(X_test)\nprint(f\"Configured pandas output type: {type(X_test_df).__name__}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In a :class:`pipeline.Pipeline`, `set_output` configures all steps to output\nDataFrames.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.feature_selection import SelectPercentile\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\n\nclf = make_pipeline(\n    StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()\n)\nclf.set_output(transform=\"pandas\")\nclf.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each transformer in the pipeline is configured to return DataFrames. This\nmeans that the final logistic regression step contains the feature names of the input.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "clf[-1].feature_names_in_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>If one uses the method `set_params`, the transformer will be\n   replaced by a new one with the default output format.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "clf.set_params(standardscaler=StandardScaler())\nclf.fit(X_train, y_train)\nclf[-1].feature_names_in_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To keep the intended behavior, use `set_output` on the new transformer\nbeforehand\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "scaler = StandardScaler().set_output(transform=\"pandas\")\nclf.set_params(standardscaler=scaler)\nclf.fit(X_train, y_train)\nclf[-1].feature_names_in_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next we load the titanic dataset to demonstrate `set_output` with\n:class:`compose.ColumnTransformer` and heterogeneous data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_openml\n\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `set_output` API can be configured globally by using :func:`set_config` and\nsetting `transform_output` to `\"pandas\"`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn import set_config\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\nset_config(transform_output=\"pandas\")\n\nnum_pipe = make_pipeline(SimpleImputer(), StandardScaler())\nnum_cols = [\"age\", \"fare\"]\nct = ColumnTransformer(\n    (\n        (\"numerical\", num_pipe, num_cols),\n        (\n            \"categorical\",\n            OneHotEncoder(\n                sparse_output=False, drop=\"if_binary\", handle_unknown=\"ignore\"\n            ),\n            [\"embarked\", \"sex\", \"pclass\"],\n        ),\n    ),\n    verbose_feature_names_out=False,\n)\nclf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())\nclf.fit(X_train, y_train)\nclf.score(X_test, y_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With the global configuration, all transformers output DataFrames. This allows us to\neasily plot the logistic regression coefficients with the corresponding feature names.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n\nlog_reg = clf[-1]\ncoef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)\n_ = coef.sort_values().plot.barh()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In order to demonstrate the :func:`config_context` functionality below, let\nus first reset `transform_output` to its default value.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "set_config(transform_output=\"default\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When configuring the output type with :func:`config_context` the\nconfiguration at the time when `transform` or `fit_transform` are\ncalled is what counts. Setting these only when you construct or fit\nthe transformer has no effect.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn import config_context\n\nscaler = StandardScaler()\nscaler.fit(X_train[num_cols])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with config_context(transform_output=\"pandas\"):\n    # the output of transform will be a Pandas DataFrame\n    X_test_scaled = scaler.transform(X_test[num_cols])\nX_test_scaled.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "outside of the context manager, the output will be a NumPy array\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_test_scaled = scaler.transform(X_test[num_cols])\nX_test_scaled[:5]"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}