{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Column Transformer with Heterogeneous Data Sources\n\nDatasets can often contain components that require different feature\nextraction and processing pipelines. This scenario might occur when:\n\n1. your dataset consists of heterogeneous data types (e.g. raster images and\n   text captions),\n2. your dataset is stored in a :class:`pandas.DataFrame` and different columns\n   require different processing pipelines.\n\nThis example demonstrates how to use\n:class:`~sklearn.compose.ColumnTransformer` on a dataset containing\ndifferent types of features. The choice of features is not particularly\nhelpful, but serves to illustrate the technique.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.decomposition import PCA\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics import classification_report\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import FunctionTransformer\nfrom sklearn.svm import LinearSVC"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 20 newsgroups dataset\n\nWe will use the `20 newsgroups dataset <20newsgroups_dataset>`, which\ncomprises posts from newsgroups on 20 topics. This dataset is split\ninto train and test subsets based on messages posted before and after\na specific date. We will only use posts from 2 categories to speed up running\ntime.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "categories = [\"sci.med\", \"sci.space\"]\nX_train, y_train = fetch_20newsgroups(\n    random_state=1,\n    subset=\"train\",\n    categories=categories,\n    remove=(\"footers\", \"quotes\"),\n    return_X_y=True,\n)\nX_test, y_test = fetch_20newsgroups(\n    random_state=1,\n    subset=\"test\",\n    categories=categories,\n    remove=(\"footers\", \"quotes\"),\n    return_X_y=True,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each feature comprises meta information about that post, such as the subject,\nand the body of the news post.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(X_train[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Creating transformers\n\nFirst, we would like a transformer that extracts the subject and\nbody of each post. Since this is a stateless transformation (does not\nrequire state information from training data), we can define a function that\nperforms the data transformation then use\n:class:`~sklearn.preprocessing.FunctionTransformer` to create a scikit-learn\ntransformer.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def subject_body_extractor(posts):\n    # construct object dtype array with two columns\n    # first column = 'subject' and second column = 'body'\n    features = np.empty(shape=(len(posts), 2), dtype=object)\n    for i, text in enumerate(posts):\n        # temporary variable `_` stores '\\n\\n'\n        headers, _, body = text.partition(\"\\n\\n\")\n        # store body text in second column\n        features[i, 1] = body\n\n        prefix = \"Subject:\"\n        sub = \"\"\n        # save text after 'Subject:' in first column\n        for line in headers.split(\"\\n\"):\n            if line.startswith(prefix):\n                sub = line[len(prefix) :]\n                break\n        features[i, 0] = sub\n\n    return features\n\n\nsubject_body_transformer = FunctionTransformer(subject_body_extractor)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will also create a transformer that extracts the\nlength of the text and the number of sentences.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def text_stats(posts):\n    return [{\"length\": len(text), \"num_sentences\": text.count(\".\")} for text in posts]\n\n\ntext_stats_transformer = FunctionTransformer(text_stats)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Classification pipeline\n\nThe pipeline below extracts the subject and body from each post using\n``SubjectBodyExtractor``, producing a (n_samples, 2) array. This array is\nthen used to compute standard bag-of-words features for the subject and body\nas well as text length and number of sentences on the body, using\n``ColumnTransformer``. We combine them, with weights, then train a\nclassifier on the combined set of features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pipeline = Pipeline(\n    [\n        # Extract subject & body\n        (\"subjectbody\", subject_body_transformer),\n        # Use ColumnTransformer to combine the subject and body features\n        (\n            \"union\",\n            ColumnTransformer(\n                [\n                    # bag-of-words for subject (col 0)\n                    (\"subject\", TfidfVectorizer(min_df=50), 0),\n                    # bag-of-words with decomposition for body (col 1)\n                    (\n                        \"body_bow\",\n                        Pipeline(\n                            [\n                                (\"tfidf\", TfidfVectorizer()),\n                                (\"best\", PCA(n_components=50, svd_solver=\"arpack\")),\n                            ]\n                        ),\n                        1,\n                    ),\n                    # Pipeline for pulling text stats from post's body\n                    (\n                        \"body_stats\",\n                        Pipeline(\n                            [\n                                (\n                                    \"stats\",\n                                    text_stats_transformer,\n                                ),  # returns a list of dicts\n                                (\n                                    \"vect\",\n                                    DictVectorizer(),\n                                ),  # list of dicts -> feature matrix\n                            ]\n                        ),\n                        1,\n                    ),\n                ],\n                # weight above ColumnTransformer features\n                transformer_weights={\n                    \"subject\": 0.8,\n                    \"body_bow\": 0.5,\n                    \"body_stats\": 1.0,\n                },\n            ),\n        ),\n        # Use an SVC classifier on the combined features\n        (\"svc\", LinearSVC(dual=False)),\n    ],\n    verbose=True,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we fit our pipeline on the training data and use it to predict\ntopics for ``X_test``. Performance metrics of our pipeline are then printed.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pipeline.fit(X_train, y_train)\ny_pred = pipeline.predict(X_test)\nprint(\"Classification report:\\n\\n{}\".format(classification_report(y_test, y_pred)))"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}