{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Semi-supervised Classification on a Text Dataset\n\nThis example demonstrates the effectiveness of semi-supervised learning\nfor text classification on :class:`TF-IDF\n<sklearn.feature_extraction.text.TfidfTransformer>` features when labeled data\nis scarce. For such purpose we compare four different approaches:\n\n1. Supervised learning using 100% of labels in the training set (best-case\n   scenario)\n\n   - Uses :class:`~sklearn.linear_model.SGDClassifier` with full supervision\n   - Represents the best possible performance when labeled data is abundant\n\n2. Supervised learning using 20% of labels in the training set (baseline)\n\n   - Same model as the best-case scenario but trained on a random 20% subset of\n     the labeled training data\n   - Shows the performance degradation of a fully supervised model due to\n     limited labeled data\n\n3. :class:`~sklearn.semi_supervised.SelfTrainingClassifier` (semi-supervised)\n\n   - Uses 20% labeled data + 80% unlabeled data for training\n   - Iteratively predicts labels for unlabeled data\n   - Demonstrates how self-training can improve performance\n\n4. :class:`~sklearn.semi_supervised.LabelSpreading` (semi-supervised)\n\n   - Uses 20% labeled data + 80% unlabeled data for training\n   - Propagates labels through the data manifold\n   - Shows how graph-based methods can leverage unlabeled data\n\nThe example uses the 20 newsgroups dataset, focusing on five categories.\nThe results demonstrate how semi-supervised methods can achieve better\nperformance than supervised learning with limited labeled data by\neffectively utilizing unlabeled samples.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.metrics import f1_score\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier\n\n# Loading dataset containing first five categories\ndata = fetch_20newsgroups(\n    subset=\"train\",\n    categories=[\n        \"alt.atheism\",\n        \"comp.graphics\",\n        \"comp.os.ms-windows.misc\",\n        \"comp.sys.ibm.pc.hardware\",\n        \"comp.sys.mac.hardware\",\n    ],\n)\n\n# Parameters\nsdg_params = dict(alpha=1e-5, penalty=\"l2\", loss=\"log_loss\")\nvectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)\n\n# Supervised Pipeline\npipeline = Pipeline(\n    [\n        (\"vect\", CountVectorizer(**vectorizer_params)),\n        (\"tfidf\", TfidfTransformer()),\n        (\"clf\", SGDClassifier(**sdg_params)),\n    ]\n)\n# SelfTraining Pipeline\nst_pipeline = Pipeline(\n    [\n        (\"vect\", CountVectorizer(**vectorizer_params)),\n        (\"tfidf\", TfidfTransformer()),\n        (\"clf\", SelfTrainingClassifier(SGDClassifier(**sdg_params))),\n    ]\n)\n# LabelSpreading Pipeline\nls_pipeline = Pipeline(\n    [\n        (\"vect\", CountVectorizer(**vectorizer_params)),\n        (\"tfidf\", TfidfTransformer()),\n        (\"clf\", LabelSpreading()),\n    ]\n)\n\n\ndef eval_and_get_f1(clf, X_train, y_train, X_test, y_test):\n    \"\"\"Evaluate model performance and return F1 score\"\"\"\n    print(f\"   Number of training samples: {len(X_train)}\")\n    print(f\"   Unlabeled samples in training set: {sum(1 for x in y_train if x == -1)}\")\n    clf.fit(X_train, y_train)\n    y_pred = clf.predict(X_test)\n    f1 = f1_score(y_test, y_pred, average=\"micro\")\n    print(f\"   Micro-averaged F1 score on test set: {f1:.3f}\")\n    print(\"\\n\")\n    return f1\n\n\nX, y = data.data, data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1. Evaluate a supervised SGDClassifier using 100% of the (labeled) training set.\nThis represents the best-case performance when the model has full access to all\nlabeled examples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f1_scores = {}\nprint(\"1. Supervised SGDClassifier on 100% of the data:\")\nf1_scores[\"Supervised (100%)\"] = eval_and_get_f1(\n    pipeline, X_train, y_train, X_test, y_test\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. Evaluate a supervised SGDClassifier trained on only 20% of the data.\nThis serves as a baseline to illustrate the performance drop caused by limiting\nthe training samples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nprint(\"2. Supervised SGDClassifier on 20% of the training data:\")\nrng = np.random.default_rng(42)\ny_mask = rng.random(len(y_train)) < 0.2\n# X_20 and y_20 are the subset of the train dataset indicated by the mask\nX_20, y_20 = map(list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m)))\nf1_scores[\"Supervised (20%)\"] = eval_and_get_f1(pipeline, X_20, y_20, X_test, y_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "3. Evaluate a semi-supervised SelfTrainingClassifier using 20% labeled and 80%\nunlabeled data.\nThe remaining 80% of the training labels are masked as unlabeled (-1),\nallowing the model to iteratively label and learn from them.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\n    \"3. SelfTrainingClassifier (semi-supervised) using 20% labeled \"\n    \"+ 80% unlabeled data):\"\n)\ny_train_semi = y_train.copy()\ny_train_semi[~y_mask] = -1\nf1_scores[\"SelfTraining\"] = eval_and_get_f1(\n    st_pipeline, X_train, y_train_semi, X_test, y_test\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "4. Evaluate a semi-supervised LabelSpreading model using 20% labeled and 80%\nunlabeled data.\nLike SelfTraining, the model infers labels for the unlabeled portion of the data\nto enhance performance.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"4. LabelSpreading (semi-supervised) using 20% labeled + 80% unlabeled data:\")\nf1_scores[\"LabelSpreading\"] = eval_and_get_f1(\n    ls_pipeline, X_train, y_train_semi, X_test, y_test\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Plot results\nVisualize the performance of different classification approaches using a bar chart.\nThis helps to compare how each method performs based on the\nmicro-averaged :func:`~sklearn.metrics.f1_score`.\nMicro-averaging computes metrics globally across all classes,\nwhich gives a single overall measure of performance and allows fair comparison\nbetween the different approaches, even in the presence of class imbalance.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nplt.figure(figsize=(10, 6))\n\nmodels = list(f1_scores.keys())\nscores = list(f1_scores.values())\n\ncolors = [\"royalblue\", \"royalblue\", \"forestgreen\", \"royalblue\"]\nbars = plt.bar(models, scores, color=colors)\n\nplt.title(\"Comparison of Classification Approaches\")\nplt.ylabel(\"Micro-averaged F1 Score on test set\")\nplt.xticks()\n\nfor bar in bars:\n    height = bar.get_height()\n    plt.text(\n        bar.get_x() + bar.get_width() / 2.0,\n        height,\n        f\"{height:.2f}\",\n        ha=\"center\",\n        va=\"bottom\",\n    )\n\nplt.figtext(\n    0.5,\n    0.02,\n    \"SelfTraining classifier shows improved performance over \"\n    \"supervised learning with limited data\",\n    ha=\"center\",\n    va=\"bottom\",\n    fontsize=10,\n    style=\"italic\",\n)\n\nplt.tight_layout()\nplt.subplots_adjust(bottom=0.15)\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}