{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# $$CatBoost\\ Tutorial$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)\n", "\n", "In this tutorial we would explore some base cases of using catboost, such as model training, cross-validation and predicting, as well as some useful features like early stopping, snapshot support, feature importances and parameters tuning.\n", " \n", "You could run this tutorial in Google Colaboratory environment with free CPU or GPU. Just click on this link." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$Contents$$\n", "* [1. Data Preparation](#$$1.\\-Data\\-Preparation$$)\n", " * [1.1 Data Loading](#1.1-Data-Loading)\n", " * [1.2 Feature Preparation](#1.2-Feature-Preparation)\n", " * [1.3 Data Splitting](#1.3-Data-Splitting)\n", "* [2. CatBoost Basics](#$$2.\\-CatBoost\\-Basics$$)\n", " * [2.1 Model Training](#2.1-Model-Training)\n", " * [2.2 Model Cross-Validation](#2.2-Model-Cross-Validation)\n", " * [2.3 Model Applying](#2.3-Model-Applying)\n", "* [3. CatBoost Features](#$$3.\\-CatBoost\\-Features$$)\n", " * [3.1 Using the best model](#3.1-Using-the-best-model)\n", " * [3.2 Early Stopping](#3.2-Early-Stopping)\n", " * [3.3 Using Baseline](#3.3-Using-Baseline)\n", " * [3.4 Snapshot Support](#3.4-Snapshot-Support)\n", " * [3.5 User Defined Objective Function](#3.5-User-Defined-Objective-Function)\n", " * [3.6 User Defined Metric Function](#3.6-User-Defined-Metric-Function)\n", " * [3.7 Staged Predict](#3.7-Staged-Predict)\n", " * [3.8 Feature Importances](#3.8-Feature-Importances)\n", " * [3.9 Eval Metrics](#3.9-Eval-Metrics)\n", " * [3.10 Learning Processes Comparison](#3.10-Learning-Processes-Comparison)\n", " * [3.11 Model Saving](#3.11-Model-Saving)\n", "* [4. Parameters Tuning](#$$4.\\-Parameters\\-Tuning$$)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $$1.\\ Data\\ Preparation$$\n", "### 1.1 CatBoost installation\n", "If you have not already installed CatBoost, you can do so by running '!pip install catboost' command. \n", " \n", "Also you should install ipywidgets package and run special command before launching jupyter notebook to draw plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install catboost\n", "!pip install scikit-learn\n", "!pip install ipywidgets\n", "!jupyter nbextension enable --py widgetsnbextension" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Data Loading\n", "The data for this tutorial can be obtained from [this page](https://www.kaggle.com/c/titanic/data) (you would have to register a kaggle account or just login with facebook or google+) or you could use catboost.datasets as in code below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "NaN | \n", "S | \n", "
| 1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "71.2833 | \n", "C85 | \n", "C | \n", "
| 2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "NaN | \n", "S | \n", "
| 3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "
| 4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "NaN | \n", "S | \n", "