{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Gaussian processes on discrete data structures\n\nThis example illustrates the use of Gaussian processes for regression and\nclassification tasks on data that are not in fixed-length feature vector form.\nThis is achieved through the use of kernel functions that operates directly\non discrete structures such as variable-length sequences, trees, and graphs.\n\nSpecifically, here the input variables are some gene sequences stored as\nvariable-length strings consisting of letters 'A', 'T', 'C', and 'G',\nwhile the output variables are floating point numbers and True/False labels\nin the regression and classification tasks, respectively.\n\nA kernel between the gene sequences is defined using R-convolution [1]_ by\nintegrating a binary letter-wise kernel over all pairs of letters among a pair\nof strings.\n\nThis example will generate three figures.\n\nIn the first figure, we visualize the value of the kernel, i.e. the similarity\nof the sequences, using a colormap. Brighter color here indicates higher\nsimilarity.\n\nIn the second figure, we show some regression result on a dataset of 6\nsequences. Here we use the 1st, 2nd, 4th, and 5th sequences as the training set\nto make predictions on the 3rd and 6th sequences.\n\nIn the third figure, we demonstrate a classification model by training on 6\nsequences and make predictions on another 5 sequences. The ground truth here is\nsimply  whether there is at least one 'A' in the sequence. Here the model makes\nfour correct classifications and fails on one.\n\n.. [1] Haussler, D. (1999). Convolution kernels on discrete structures\n       (Vol. 646). Technical report, Department of Computer Science, University\n       of California at Santa Cruz.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nfrom sklearn.base import clone\nfrom sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor\nfrom sklearn.gaussian_process.kernels import GenericKernelMixin, Hyperparameter, Kernel\n\n\nclass SequenceKernel(GenericKernelMixin, Kernel):\n    \"\"\"\n    A minimal (but valid) convolutional kernel for sequences of variable\n    lengths.\"\"\"\n\n    def __init__(self, baseline_similarity=0.5, baseline_similarity_bounds=(1e-5, 1)):\n        self.baseline_similarity = baseline_similarity\n        self.baseline_similarity_bounds = baseline_similarity_bounds\n\n    @property\n    def hyperparameter_baseline_similarity(self):\n        return Hyperparameter(\n            \"baseline_similarity\", \"numeric\", self.baseline_similarity_bounds\n        )\n\n    def _f(self, s1, s2):\n        \"\"\"\n        kernel value between a pair of sequences\n        \"\"\"\n        return sum(\n            [1.0 if c1 == c2 else self.baseline_similarity for c1 in s1 for c2 in s2]\n        )\n\n    def _g(self, s1, s2):\n        \"\"\"\n        kernel derivative between a pair of sequences\n        \"\"\"\n        return sum([0.0 if c1 == c2 else 1.0 for c1 in s1 for c2 in s2])\n\n    def __call__(self, X, Y=None, eval_gradient=False):\n        if Y is None:\n            Y = X\n\n        if eval_gradient:\n            return (\n                np.array([[self._f(x, y) for y in Y] for x in X]),\n                np.array([[[self._g(x, y)] for y in Y] for x in X]),\n            )\n        else:\n            return np.array([[self._f(x, y) for y in Y] for x in X])\n\n    def diag(self, X):\n        return np.array([self._f(x, x) for x in X])\n\n    def is_stationary(self):\n        return False\n\n    def clone_with_theta(self, theta):\n        cloned = clone(self)\n        cloned.theta = theta\n        return cloned\n\n\nkernel = SequenceKernel()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Sequence similarity matrix under the kernel\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nX = np.array([\"AGCT\", \"AGC\", \"AACT\", \"TAA\", \"AAA\", \"GAACA\"])\n\nK = kernel(X)\nD = kernel.diag(X)\n\nplt.figure(figsize=(8, 5))\nplt.imshow(np.diag(D**-0.5).dot(K).dot(np.diag(D**-0.5)))\nplt.xticks(np.arange(len(X)), X)\nplt.yticks(np.arange(len(X)), X)\nplt.title(\"Sequence similarity under the kernel\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Regression\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X = np.array([\"AGCT\", \"AGC\", \"AACT\", \"TAA\", \"AAA\", \"GAACA\"])\nY = np.array([1.0, 1.0, 2.0, 2.0, 3.0, 3.0])\n\ntraining_idx = [0, 1, 3, 4]\ngp = GaussianProcessRegressor(kernel=kernel)\ngp.fit(X[training_idx], Y[training_idx])\n\nplt.figure(figsize=(8, 5))\nplt.bar(np.arange(len(X)), gp.predict(X), color=\"b\", label=\"prediction\")\nplt.bar(training_idx, Y[training_idx], width=0.2, color=\"r\", alpha=1, label=\"training\")\nplt.xticks(np.arange(len(X)), X)\nplt.title(\"Regression on sequences\")\nplt.legend()\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Classification\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_train = np.array([\"AGCT\", \"CGA\", \"TAAC\", \"TCG\", \"CTTT\", \"TGCT\"])\n# whether there are 'A's in the sequence\nY_train = np.array([True, True, True, False, False, False])\n\ngp = GaussianProcessClassifier(kernel)\ngp.fit(X_train, Y_train)\n\nX_test = [\"AAA\", \"ATAG\", \"CTC\", \"CT\", \"C\"]\nY_test = [True, True, False, False, False]\n\nplt.figure(figsize=(8, 5))\nplt.scatter(\n    np.arange(len(X_train)),\n    [1.0 if c else -1.0 for c in Y_train],\n    s=100,\n    marker=\"o\",\n    edgecolor=\"none\",\n    facecolor=(1, 0.75, 0),\n    label=\"training\",\n)\nplt.scatter(\n    len(X_train) + np.arange(len(X_test)),\n    [1.0 if c else -1.0 for c in Y_test],\n    s=100,\n    marker=\"o\",\n    edgecolor=\"none\",\n    facecolor=\"r\",\n    label=\"truth\",\n)\nplt.scatter(\n    len(X_train) + np.arange(len(X_test)),\n    [1.0 if c else -1.0 for c in gp.predict(X_test)],\n    s=100,\n    marker=\"x\",\n    facecolor=\"b\",\n    linewidth=2,\n    label=\"prediction\",\n)\nplt.xticks(np.arange(len(X_train) + len(X_test)), np.concatenate((X_train, X_test)))\nplt.yticks([-1, 1], [False, True])\nplt.title(\"Classification on sequences\")\nplt.legend()\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}