{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Robust vs Empirical covariance estimate\n\nThe usual covariance maximum likelihood estimate is very sensitive to the\npresence of outliers in the data set. In such a case, it would be better to\nuse a robust estimator of covariance to guarantee that the estimation is\nresistant to \"erroneous\" observations in the data set. [1]_, [2]_\n\n## Minimum Covariance Determinant Estimator\nThe Minimum Covariance Determinant estimator is a robust, high-breakdown point\n(i.e. it can be used to estimate the covariance matrix of highly contaminated\ndatasets, up to\n$\\frac{n_\\text{samples} - n_\\text{features}-1}{2}$ outliers) estimator of\ncovariance. The idea is to find\n$\\frac{n_\\text{samples} + n_\\text{features}+1}{2}$\nobservations whose empirical covariance has the smallest determinant, yielding\na \"pure\" subset of observations from which to compute standards estimates of\nlocation and covariance. After a correction step aiming at compensating the\nfact that the estimates were learned from only a portion of the initial data,\nwe end up with robust estimates of the data set location and covariance.\n\nThe Minimum Covariance Determinant estimator (MCD) has been introduced by\nP.J.Rousseuw in [3]_.\n\n## Evaluation\nIn this example, we compare the estimation errors that are made when using\nvarious types of location and covariance estimates on contaminated Gaussian\ndistributed data sets:\n\n- The mean and the empirical covariance of the full dataset, which break\n  down as soon as there are outliers in the data set\n- The robust MCD, that has a low error provided\n  $n_\\text{samples} > 5n_\\text{features}$\n- The mean and the empirical covariance of the observations that are known\n  to be good ones. This can be considered as a \"perfect\" MCD estimation,\n  so one can trust our implementation by comparing to this case.\n\n\n## References\n.. [1] Johanna Hardin, David M Rocke. The distribution of robust distances.\n    Journal of Computational and Graphical Statistics. December 1, 2005,\n    14(4): 928-946.\n.. [2] Zoubir A., Koivunen V., Chakhchoukh Y. and Muma M. (2012). Robust\n    estimation in signal processing: A tutorial-style treatment of\n    fundamental concepts. IEEE Signal Processing Magazine 29(4), 61-80.\n.. [3] P. J. Rousseeuw. Least median of squares regression. Journal of American\n    Statistical Ass., 79:871, 1984.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.font_manager\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.covariance import EmpiricalCovariance, MinCovDet\n\n# example settings\nn_samples = 80\nn_features = 5\nrepeat = 10\n\nrange_n_outliers = np.concatenate(\n    (\n        np.linspace(0, n_samples / 8, 5),\n        np.linspace(n_samples / 8, n_samples / 2, 5)[1:-1],\n    )\n).astype(int)\n\n# definition of arrays to store results\nerr_loc_mcd = np.zeros((range_n_outliers.size, repeat))\nerr_cov_mcd = np.zeros((range_n_outliers.size, repeat))\nerr_loc_emp_full = np.zeros((range_n_outliers.size, repeat))\nerr_cov_emp_full = np.zeros((range_n_outliers.size, repeat))\nerr_loc_emp_pure = np.zeros((range_n_outliers.size, repeat))\nerr_cov_emp_pure = np.zeros((range_n_outliers.size, repeat))\n\n# computation\nfor i, n_outliers in enumerate(range_n_outliers):\n    for j in range(repeat):\n        rng = np.random.RandomState(i * j)\n\n        # generate data\n        X = rng.randn(n_samples, n_features)\n        # add some outliers\n        outliers_index = rng.permutation(n_samples)[:n_outliers]\n        outliers_offset = 10.0 * (\n            np.random.randint(2, size=(n_outliers, n_features)) - 0.5\n        )\n        X[outliers_index] += outliers_offset\n        inliers_mask = np.ones(n_samples).astype(bool)\n        inliers_mask[outliers_index] = False\n\n        # fit a Minimum Covariance Determinant (MCD) robust estimator to data\n        mcd = MinCovDet().fit(X)\n        # compare raw robust estimates with the true location and covariance\n        err_loc_mcd[i, j] = np.sum(mcd.location_**2)\n        err_cov_mcd[i, j] = mcd.error_norm(np.eye(n_features))\n\n        # compare estimators learned from the full data set with true\n        # parameters\n        err_loc_emp_full[i, j] = np.sum(X.mean(0) ** 2)\n        err_cov_emp_full[i, j] = (\n            EmpiricalCovariance().fit(X).error_norm(np.eye(n_features))\n        )\n\n        # compare with an empirical covariance learned from a pure data set\n        # (i.e. \"perfect\" mcd)\n        pure_X = X[inliers_mask]\n        pure_location = pure_X.mean(0)\n        pure_emp_cov = EmpiricalCovariance().fit(pure_X)\n        err_loc_emp_pure[i, j] = np.sum(pure_location**2)\n        err_cov_emp_pure[i, j] = pure_emp_cov.error_norm(np.eye(n_features))\n\n# Display results\nfont_prop = matplotlib.font_manager.FontProperties(size=11)\nplt.subplot(2, 1, 1)\nlw = 2\nplt.errorbar(\n    range_n_outliers,\n    err_loc_mcd.mean(1),\n    yerr=err_loc_mcd.std(1) / np.sqrt(repeat),\n    label=\"Robust location\",\n    lw=lw,\n    color=\"m\",\n)\nplt.errorbar(\n    range_n_outliers,\n    err_loc_emp_full.mean(1),\n    yerr=err_loc_emp_full.std(1) / np.sqrt(repeat),\n    label=\"Full data set mean\",\n    lw=lw,\n    color=\"green\",\n)\nplt.errorbar(\n    range_n_outliers,\n    err_loc_emp_pure.mean(1),\n    yerr=err_loc_emp_pure.std(1) / np.sqrt(repeat),\n    label=\"Pure data set mean\",\n    lw=lw,\n    color=\"black\",\n)\nplt.title(\"Influence of outliers on the location estimation\")\nplt.ylabel(r\"Error ($||\\mu - \\hat{\\mu}||_2^2$)\")\nplt.legend(loc=\"upper left\", prop=font_prop)\n\nplt.subplot(2, 1, 2)\nx_size = range_n_outliers.size\nplt.errorbar(\n    range_n_outliers,\n    err_cov_mcd.mean(1),\n    yerr=err_cov_mcd.std(1),\n    label=\"Robust covariance (mcd)\",\n    color=\"m\",\n)\nplt.errorbar(\n    range_n_outliers[: (x_size // 5 + 1)],\n    err_cov_emp_full.mean(1)[: (x_size // 5 + 1)],\n    yerr=err_cov_emp_full.std(1)[: (x_size // 5 + 1)],\n    label=\"Full data set empirical covariance\",\n    color=\"green\",\n)\nplt.plot(\n    range_n_outliers[(x_size // 5) : (x_size // 2 - 1)],\n    err_cov_emp_full.mean(1)[(x_size // 5) : (x_size // 2 - 1)],\n    color=\"green\",\n    ls=\"--\",\n)\nplt.errorbar(\n    range_n_outliers,\n    err_cov_emp_pure.mean(1),\n    yerr=err_cov_emp_pure.std(1),\n    label=\"Pure data set empirical covariance\",\n    color=\"black\",\n)\nplt.title(\"Influence of outliers on the covariance estimation\")\nplt.xlabel(\"Amount of contamination (%)\")\nplt.ylabel(\"RMSE\")\nplt.legend(loc=\"center\", prop=font_prop)\n\nplt.tight_layout()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}