{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Effect of transforming the targets in regression model\n\nIn this example, we give an overview of\n:class:`~sklearn.compose.TransformedTargetRegressor`. We use two examples\nto illustrate the benefit of transforming the targets before learning a linear\nregression model. The first example uses synthetic data while the second\nexample is based on the Ames housing data set.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Synthetic example\n\nA synthetic random regression dataset is generated. The targets ``y`` are\nmodified by:\n\n1. translating all targets such that all entries are\n   non-negative (by adding the absolute value of the lowest ``y``) and\n2. applying an exponential function to obtain non-linear\n   targets which cannot be fitted using a simple linear model.\n\nTherefore, a logarithmic (`np.log1p`) and an exponential function\n(`np.expm1`) will be used to transform the targets before training a linear\nregression model and using it for prediction.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nfrom sklearn.datasets import make_regression\n\nX, y = make_regression(n_samples=10_000, noise=100, random_state=0)\ny = np.expm1((y + abs(y.min())) / 200)\ny_trans = np.log1p(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Below we plot the probability density functions of the target\nbefore and after applying the logarithmic functions.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nfrom sklearn.model_selection import train_test_split\n\nf, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, density=True)\nax0.set_xlim([0, 2000])\nax0.set_ylabel(\"Probability\")\nax0.set_xlabel(\"Target\")\nax0.set_title(\"Target distribution\")\n\nax1.hist(y_trans, bins=100, density=True)\nax1.set_ylabel(\"Probability\")\nax1.set_xlabel(\"Target\")\nax1.set_title(\"Transformed target distribution\")\n\nf.suptitle(\"Synthetic data\", y=1.05)\nplt.tight_layout()\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "At first, a linear model will be applied on the original targets. Due to the\nnon-linearity, the model trained will not be precise during\nprediction. Subsequently, a logarithmic function is used to linearize the\ntargets, allowing better prediction even with a similar linear model as\nreported by the median absolute error (MedAE).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import median_absolute_error, r2_score\n\n\ndef compute_score(y_true, y_pred):\n    return {\n        \"R2\": f\"{r2_score(y_true, y_pred):.3f}\",\n        \"MedAE\": f\"{median_absolute_error(y_true, y_pred):.3f}\",\n    }"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.compose import TransformedTargetRegressor\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.metrics import PredictionErrorDisplay\n\nf, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nridge_cv = RidgeCV().fit(X_train, y_train)\ny_pred_ridge = ridge_cv.predict(X_test)\n\nridge_cv_with_trans_target = TransformedTargetRegressor(\n    regressor=RidgeCV(), func=np.log1p, inverse_func=np.expm1\n).fit(X_train, y_train)\ny_pred_ridge_with_trans_target = ridge_cv_with_trans_target.predict(X_test)\n\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge,\n    kind=\"actual_vs_predicted\",\n    ax=ax0,\n    scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge_with_trans_target,\n    kind=\"actual_vs_predicted\",\n    ax=ax1,\n    scatter_kwargs={\"alpha\": 0.5},\n)\n\n# Add the score in the legend of each axis\nfor ax, y_pred in zip([ax0, ax1], [y_pred_ridge, y_pred_ridge_with_trans_target]):\n    for name, score in compute_score(y_test, y_pred).items():\n        ax.plot([], [], \" \", label=f\"{name}={score}\")\n    ax.legend(loc=\"upper left\")\n\nax0.set_title(\"Ridge regression \\n without target transformation\")\nax1.set_title(\"Ridge regression \\n with target transformation\")\nf.suptitle(\"Synthetic data\", y=1.05)\nplt.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Real-world data set\n\nIn a similar manner, the Ames housing data set is used to show the impact\nof transforming the targets before learning a model. In this example, the\ntarget to be predicted is the selling price of each house.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_openml\nfrom sklearn.preprocessing import quantile_transform\n\names = fetch_openml(name=\"house_prices\", as_frame=True)\n# Keep only numeric columns\nX = ames.data.select_dtypes(np.number)\n# Remove columns with NaN or Inf values\nX = X.drop(columns=[\"LotFrontage\", \"GarageYrBlt\", \"MasVnrArea\"])\n# Let the price be in k$\ny = ames.target / 1000\ny_trans = quantile_transform(\n    y.to_frame(), n_quantiles=900, output_distribution=\"normal\", copy=True\n).squeeze()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A :class:`~sklearn.preprocessing.QuantileTransformer` is used to normalize\nthe target distribution before applying a\n:class:`~sklearn.linear_model.RidgeCV` model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, density=True)\nax0.set_ylabel(\"Probability\")\nax0.set_xlabel(\"Target\")\nax0.set_title(\"Target distribution\")\n\nax1.hist(y_trans, bins=100, density=True)\nax1.set_ylabel(\"Probability\")\nax1.set_xlabel(\"Target\")\nax1.set_title(\"Transformed target distribution\")\n\nf.suptitle(\"Ames housing data: selling price\", y=1.05)\nplt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The effect of the transformer is weaker than on the synthetic data. However,\nthe transformation results in an increase in $R^2$ and large decrease\nof the MedAE. The residual plot (predicted target - true target vs predicted\ntarget) without target transformation takes on a curved, 'reverse smile'\nshape due to residual values that vary depending on the value of predicted\ntarget. With target transformation, the shape is more linear indicating\nbetter model fit.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import QuantileTransformer\n\nf, (ax0, ax1) = plt.subplots(2, 2, sharey=\"row\", figsize=(6.5, 8))\n\nridge_cv = RidgeCV().fit(X_train, y_train)\ny_pred_ridge = ridge_cv.predict(X_test)\n\nridge_cv_with_trans_target = TransformedTargetRegressor(\n    regressor=RidgeCV(),\n    transformer=QuantileTransformer(n_quantiles=900, output_distribution=\"normal\"),\n).fit(X_train, y_train)\ny_pred_ridge_with_trans_target = ridge_cv_with_trans_target.predict(X_test)\n\n# plot the actual vs predicted values\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge,\n    kind=\"actual_vs_predicted\",\n    ax=ax0[0],\n    scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge_with_trans_target,\n    kind=\"actual_vs_predicted\",\n    ax=ax0[1],\n    scatter_kwargs={\"alpha\": 0.5},\n)\n\n# Add the score in the legend of each axis\nfor ax, y_pred in zip([ax0[0], ax0[1]], [y_pred_ridge, y_pred_ridge_with_trans_target]):\n    for name, score in compute_score(y_test, y_pred).items():\n        ax.plot([], [], \" \", label=f\"{name}={score}\")\n    ax.legend(loc=\"upper left\")\n\nax0[0].set_title(\"Ridge regression \\n without target transformation\")\nax0[1].set_title(\"Ridge regression \\n with target transformation\")\n\n# plot the residuals vs the predicted values\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge,\n    kind=\"residual_vs_predicted\",\n    ax=ax1[0],\n    scatter_kwargs={\"alpha\": 0.5},\n)\nPredictionErrorDisplay.from_predictions(\n    y_test,\n    y_pred_ridge_with_trans_target,\n    kind=\"residual_vs_predicted\",\n    ax=ax1[1],\n    scatter_kwargs={\"alpha\": 0.5},\n)\nax1[0].set_title(\"Ridge regression \\n without target transformation\")\nax1[1].set_title(\"Ridge regression \\n with target transformation\")\n\nf.suptitle(\"Ames housing data: selling price\", y=1.05)\nplt.tight_layout()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}