From 25e9cab50f9144eafa31847c8e9ef12cfe8195f3 Mon Sep 17 00:00:00 2001 From: Nilanjan Sarkar <99826967+Nilanjan2223@users.noreply.github.com> Date: Thu, 15 May 2025 08:01:31 +0530 Subject: [PATCH] Created using Colab --- ml/cc/exercises/linear_regression_taxi.ipynb | 1080 ++++++++++++++++++ 1 file changed, 1080 insertions(+) create mode 100644 ml/cc/exercises/linear_regression_taxi.ipynb diff --git a/ml/cc/exercises/linear_regression_taxi.ipynb b/ml/cc/exercises/linear_regression_taxi.ipynb new file mode 100644 index 00000000..52e3a25e --- /dev/null +++ b/ml/cc/exercises/linear_regression_taxi.ipynb @@ -0,0 +1,1080 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "X53vZqc7PxCA" + }, + "outputs": [], + "source": [ + "#@title Copyright 2023 Google LLC. Double-click here for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mWCXBrPgQD0P" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratories (Colabs) for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).\n", + "\n", + "# Linear Regression\n", + "In this Colab you will use a real dataset to train a model to predict the fare of a taxi ride in Chicago, Illinois.\n", + "\n", + "## Learning Objectives\n", + "After completing this Colab, you'll be able to:\n", + "\n", + " * Read a .csv file into a [pandas](https://developers.google.com/machine-learning/glossary/#pandas) DataFrame.\n", + " * Explore a [dataset](https://developers.google.com/machine-learning/glossary/#data_set) with Python visualization libraries.\n", + " * Experiment with different [features](https://developers.google.com/machine-learning/glossary/#feature) to build a linear regression model.\n", + " * Tune the model's [hyperparameters](https://developers.google.com/machine-learning/glossary/#hyperparameter).\n", + " * Compare training runs using [root mean squared error](https://developers.google.com/machine-learning/glossary/#root-mean-squared-error-rmse) and [loss curves](https://developers.google.com/machine-learning/glossary/#loss-curve).\n", + "\n", + "## Dataset Description\n", + "The [dataset for this exercise](https://storage.mtls.cloud.google.com/mlcc-nextgen-internal/chicago_taxi_train.csv) is derived from the [City of Chicago Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew). The data for this exercise is a subset of the Taxi Trips data, and focuses on a two-day period in May of 2022." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bBJQc5TgRrFx" + }, + "source": [ + "# Part 1 - Setup Exercise\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V9pkosc63-63" + }, + "source": [ + "## Load required modules\n", + "\n", + "This exercise depends on several Python libraries to help with data manipulation, machine learning tasks, and data visualization.\n", + "\n", + "**Instructions**\n", + "1. Run the **Install required libraries** code cell (below).\n", + "1. Run the **Load dependencies** code cell (below)." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "LihQB7ycKEnb", + "outputId": "81a01254-7fa5-423b-ac5d-9f4be477ab0d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: keras~=3.8.0 in /usr/local/lib/python3.11/dist-packages (3.8.0)\n", + "Requirement already satisfied: matplotlib~=3.10.0 in /usr/local/lib/python3.11/dist-packages (3.10.0)\n", + "Requirement already satisfied: numpy~=2.0.0 in /usr/local/lib/python3.11/dist-packages (2.0.2)\n", + "Requirement already satisfied: pandas~=2.2.0 in /usr/local/lib/python3.11/dist-packages (2.2.2)\n", + "Requirement already satisfied: tensorflow~=2.18.0 in /usr/local/lib/python3.11/dist-packages (2.18.0)\n", + "Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (1.4.0)\n", + "Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (13.9.4)\n", + "Requirement already satisfied: namex in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (0.0.9)\n", + "Requirement already satisfied: h5py in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (3.13.0)\n", + "Requirement already satisfied: optree in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (0.15.0)\n", + "Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (0.4.1)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from keras~=3.8.0) (24.2)\n", + "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (1.3.2)\n", + "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (0.12.1)\n", + "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (4.57.0)\n", + "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (1.4.8)\n", + "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (11.2.1)\n", + "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (3.2.3)\n", + "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib~=3.10.0) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas~=2.2.0) (2025.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas~=2.2.0) (2025.2)\n", + "Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (1.6.3)\n", + "Requirement already satisfied: flatbuffers>=24.3.25 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (25.2.10)\n", + "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (0.6.0)\n", + "Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (0.2.0)\n", + "Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (18.1.1)\n", + "Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (3.4.0)\n", + "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (5.29.4)\n", + "Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (2.32.3)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (75.2.0)\n", + "Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (1.17.0)\n", + "Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (3.1.0)\n", + "Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (4.13.2)\n", + "Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (1.17.2)\n", + "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (1.71.0)\n", + "Requirement already satisfied: tensorboard<2.19,>=2.18 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (2.18.0)\n", + "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow~=2.18.0) (0.37.1)\n", + "Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from astunparse>=1.6.0->tensorflow~=2.18.0) (0.45.1)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow~=2.18.0) (3.4.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow~=2.18.0) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow~=2.18.0) (2.4.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow~=2.18.0) (2025.4.26)\n", + "Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow~=2.18.0) (3.8)\n", + "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow~=2.18.0) (0.7.2)\n", + "Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow~=2.18.0) (3.1.3)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras~=3.8.0) (3.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras~=3.8.0) (2.19.1)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->keras~=3.8.0) (0.1.2)\n", + "Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.11/dist-packages (from werkzeug>=1.0.1->tensorboard<2.19,>=2.18->tensorflow~=2.18.0) (3.0.2)\n", + "\n", + "\n", + "All requirements successfully installed.\n" + ] + } + ], + "source": [ + "#@title Install required libraries\n", + "\n", + "!pip install keras~=3.8.0 \\\n", + " matplotlib~=3.10.0 \\\n", + " numpy~=2.0.0 \\\n", + " pandas~=2.2.0 \\\n", + " tensorflow~=2.18.0\n", + "\n", + "print('\\n\\nAll requirements successfully installed.')" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "wHBXW8ob16z3" + }, + "outputs": [], + "source": [ + "#@title Code - Load dependencies\n", + "\n", + "#general\n", + "import io\n", + "\n", + "# data\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# machine learning\n", + "import keras\n", + "\n", + "# data visualization\n", + "import plotly.express as px\n", + "from plotly.subplots import make_subplots\n", + "import plotly.graph_objects as go\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sgR4YRjj5T-b" + }, + "source": [ + "## Load the dataset\n", + "\n", + "\n", + "The following code cell loads the dataset and creates a pandas DataFrame.\n", + "\n", + "You can think of a DataFrame like a spreadsheet with rows and columns. The rows represent individual data examples, and the columns represent the attributes associated with each example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "abmswn6USJjQ" + }, + "outputs": [], + "source": [ + "# @title\n", + "chicago_taxi_dataset = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iKE0s1hNQ4H9" + }, + "source": [ + "## Update the dataframe\n", + "\n", + "The following code cell updates the DataFrame to use only specific columns from the dataset.\n", + "\n", + "Notice that that output shows just a sample of the dataset, but there should be enough information for you to identify the features associated with the dataset, and have a look at the actual data for a few examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "YuLz6IgGP2LE" + }, + "outputs": [], + "source": [ + "#@title Code - Read dataset\n", + "\n", + "# Updates dataframe to use specific columns.\n", + "training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]\n", + "\n", + "print('Read dataset completed successfully.')\n", + "print('Total number of rows: {0}\\n\\n'.format(len(training_df.index)))\n", + "training_df.head(200)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RUL471vSR28O" + }, + "source": [ + "# Part 2 - Dataset Exploration\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7mhqzPIS9nFv" + }, + "source": [ + "## View dataset statistics\n", + "\n", + "A large part of most machine learning projects is getting to know your data. In this step, you will use the ``DataFrame.describe`` method to view descriptive statistics about the dataset and answer some important questions about the data.\n", + "\n", + "**Instructions**\n", + "1. Run the **View dataset statistics** code cell.\n", + "1. Inspect the output and answer these questions:\n", + " * What is the maximum fare?\n", + " * What is the mean distance across all trips?\n", + " * How many cab companies are in the dataset?\n", + " * What is the most frequent payment type?\n", + " * Are any features missing data?\n", + "1. Run the code **View answers to dataset statistics** code cell to check your answers.\n", + "\n", + "\n", + "You might be wondering why there are groups of `NaN` (not a number) values listed in the output. When working with data in Python, you may see this value if the result of a calculation can not be computed or if there is missing information. For example, in the taxi dataset `PAYMENT_TYPE` and `COMPANY` are non-numeric, categorical features; numeric information such as mean and max do not make sense for categorical features so the output displays `NaN`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "pkuQNjgoAKYt" + }, + "outputs": [], + "source": [ + "#@title Code - View dataset statistics\n", + "\n", + "print('Total number of rows: {0}\\n\\n'.format(len(training_df.index)))\n", + "training_df.describe(include='all')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "VQ9R5o7CcFzY" + }, + "outputs": [], + "source": [ + "#@title Double-click or run to view answers about dataset statistics\n", + "\n", + "answer = '''\n", + "What is the maximum fare? \t\t\t\t Answer: $159.25\n", + "What is the mean distance across all trips? \t\tAnswer: 8.2895 miles\n", + "How many cab companies are in the dataset? \t\t Answer: 31\n", + "What is the most frequent payment type? \t\t Answer: Credit Card\n", + "Are any features missing data? \t\t\t\t Answer: No\n", + "'''\n", + "\n", + "# You should be able to find the answers to the questions about the dataset\n", + "# by inspecting the table output after running the DataFrame describe method.\n", + "#\n", + "# Run this code cell to verify your answers.\n", + "\n", + "# What is the maximum fare?\n", + "max_fare = training_df['FARE'].max()\n", + "print(\"What is the maximum fare? \\t\\t\\t\\tAnswer: ${fare:.2f}\".format(fare = max_fare))\n", + "\n", + "# What is the mean distance across all trips?\n", + "mean_distance = training_df['TRIP_MILES'].mean()\n", + "print(\"What is the mean distance across all trips? \\t\\tAnswer: {mean:.4f} miles\".format(mean = mean_distance))\n", + "\n", + "# How many cab companies are in the dataset?\n", + "num_unique_companies = training_df['COMPANY'].nunique()\n", + "print(\"How many cab companies are in the dataset? \\t\\tAnswer: {number}\".format(number = num_unique_companies))\n", + "\n", + "# What is the most frequent payment type?\n", + "most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()\n", + "print(\"What is the most frequent payment type? \\t\\tAnswer: {type}\".format(type = most_freq_payment_type))\n", + "\n", + "# Are any features missing data?\n", + "missing_values = training_df.isnull().sum().sum()\n", + "print(\"Are any features missing data? \\t\\t\\t\\tAnswer:\", \"No\" if missing_values == 0 else \"Yes\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-StQ4-wbBpIP" + }, + "source": [ + "## Generate a correlation matrix\n", + "\n", + "An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). If you have ever taken a taxi ride before, your experience is probably telling you that the fare is typically associated with the distance traveled and the duration of the trip. But, is there a way for you to learn more about how well these features correlate to the fare (label)?\n", + "\n", + "In this step, you will use a **correlation matrix** to identify features whose values correlate well with the label. Correlation values have the following meanings:\n", + "\n", + " * **`1.0`**: perfect positive correlation; that is, when one attribute rises, the other attribute rises.\n", + " * **`-1.0`**: perfect negative correlation; that is, when one attribute rises, the other attribute falls.\n", + " * **`0.0`**: no correlation; the two columns [are not linearly related](https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg).\n", + "\n", + "In general, the higher the absolute value of a correlation value, the greater its predictive power.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Inspect the code in the **View correlation matrix** code cell.\n", + "1. Run the **View correlation matrix** code cell and inspect the output.\n", + "1. **Check your understanding** by answering these questions:\n", + " * Which feature correlates most strongly to the label FARE?\n", + " * Which feature correlates least strongly to the label FARE?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "-1kFmfdFDVmv" + }, + "outputs": [], + "source": [ + "#@title Code - View correlation matrix\n", + "training_df.corr(numeric_only = True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ExPq1h6wIzvR" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers about the correlation matrix\n", + "\n", + "# Which feature correlates most strongly to the label FARE?\n", + "# ---------------------------------------------------------\n", + "answer = '''\n", + "The feature with the strongest correlation to the FARE is TRIP_MILES.\n", + "As you might expect, TRIP_MILES looks like a good feature to start with to train\n", + "the model. Also, notice that the feature TRIP_SECONDS has a strong correlation\n", + "with fare too.\n", + "'''\n", + "print(answer)\n", + "\n", + "\n", + "# Which feature correlates least strongly to the label FARE?\n", + "# -----------------------------------------------------------\n", + "answer = '''The feature with the weakest correlation to the FARE is TIP_RATE.'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rqklIw96G7JA" + }, + "source": [ + "## Visualize relationships in dataset\n", + "\n", + "Sometimes it is helpful to visualize relationships between features in a dataset; one way to do this is with a pair plot. A **pair plot** generates a grid of pairwise plots to visualize the relationship of each feature with all other features all in one place.\n", + "\n", + "**Instructions**\n", + "1. Run the **View pair plot** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ph0FE7ZxHY36" + }, + "outputs": [], + "source": [ + "#@title Code - View pairplot\n", + "sns.pairplot(training_df, x_vars=[\"FARE\", \"TRIP_MILES\", \"TRIP_SECONDS\"], y_vars=[\"FARE\", \"TRIP_MILES\", \"TRIP_SECONDS\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zrereRcYR9KG" + }, + "source": [ + "# Part 3 - Train Model\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PfRhSs_RR2VI" + }, + "source": [ + "## Define functions to view model information\n", + "\n", + "To help visualize the results of each training run you will generate two plots at the end of each experiment:\n", + "\n", + "* a scatter plot of the features vs. the label with a line showing the output of the trained model\n", + "* a loss curve\n", + "\n", + "For this exercise, the plotting functions are provided for you. Unless you are interested, it is not important for you to understand how these plotting functions work.\n", + "\n", + "**Instructions**\n", + "1. Run the **Define plotting functions** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "EE7nBxoMUtE9" + }, + "outputs": [], + "source": [ + "#@title Define plotting functions\n", + "\n", + "def make_plots(df, feature_names, label_name, model_output, sample_size=200):\n", + "\n", + " random_sample = df.sample(n=sample_size).copy()\n", + " random_sample.reset_index()\n", + " weights, bias, epochs, rmse = model_output\n", + "\n", + " is_2d_plot = len(feature_names) == 1\n", + " model_plot_type = \"scatter\" if is_2d_plot else \"surface\"\n", + " fig = make_subplots(rows=1, cols=2,\n", + " subplot_titles=(\"Loss Curve\", \"Model Plot\"),\n", + " specs=[[{\"type\": \"scatter\"}, {\"type\": model_plot_type}]])\n", + "\n", + " plot_data(random_sample, feature_names, label_name, fig)\n", + " plot_model(random_sample, feature_names, weights, bias, fig)\n", + " plot_loss_curve(epochs, rmse, fig)\n", + "\n", + " fig.show()\n", + " return\n", + "\n", + "def plot_loss_curve(epochs, rmse, fig):\n", + " curve = px.line(x=epochs, y=rmse)\n", + " curve.update_traces(line_color='#ff0000', line_width=3)\n", + "\n", + " fig.append_trace(curve.data[0], row=1, col=1)\n", + " fig.update_xaxes(title_text=\"Epoch\", row=1, col=1)\n", + " fig.update_yaxes(title_text=\"Root Mean Squared Error\", row=1, col=1, range=[rmse.min()*0.8, rmse.max()])\n", + "\n", + " return\n", + "\n", + "def plot_data(df, features, label, fig):\n", + " if len(features) == 1:\n", + " scatter = px.scatter(df, x=features[0], y=label)\n", + " else:\n", + " scatter = px.scatter_3d(df, x=features[0], y=features[1], z=label)\n", + "\n", + " fig.append_trace(scatter.data[0], row=1, col=2)\n", + " if len(features) == 1:\n", + " fig.update_xaxes(title_text=features[0], row=1, col=2)\n", + " fig.update_yaxes(title_text=label, row=1, col=2)\n", + " else:\n", + " fig.update_layout(scene1=dict(xaxis_title=features[0], yaxis_title=features[1], zaxis_title=label))\n", + "\n", + " return\n", + "\n", + "def plot_model(df, features, weights, bias, fig):\n", + " df['FARE_PREDICTED'] = bias[0]\n", + "\n", + " for index, feature in enumerate(features):\n", + " df['FARE_PREDICTED'] = df['FARE_PREDICTED'] + weights[index][0] * df[feature]\n", + "\n", + " if len(features) == 1:\n", + " model = px.line(df, x=features[0], y='FARE_PREDICTED')\n", + " model.update_traces(line_color='#ff0000', line_width=3)\n", + " else:\n", + " z_name, y_name = \"FARE_PREDICTED\", features[1]\n", + " z = [df[z_name].min(), (df[z_name].max() - df[z_name].min()) / 2, df[z_name].max()]\n", + " y = [df[y_name].min(), (df[y_name].max() - df[y_name].min()) / 2, df[y_name].max()]\n", + " x = []\n", + " for i in range(len(y)):\n", + " x.append((z[i] - weights[1][0] * y[i] - bias[0]) / weights[0][0])\n", + "\n", + " plane=pd.DataFrame({'x':x, 'y':y, 'z':[z] * 3})\n", + "\n", + " light_yellow = [[0, '#89CFF0'], [1, '#FFDB58']]\n", + " model = go.Figure(data=go.Surface(x=plane['x'], y=plane['y'], z=plane['z'],\n", + " colorscale=light_yellow))\n", + "\n", + " fig.add_trace(model.data[0], row=1, col=2)\n", + "\n", + " return\n", + "\n", + "def model_info(feature_names, label_name, model_output):\n", + " weights = model_output[0]\n", + " bias = model_output[1]\n", + "\n", + " nl = \"\\n\"\n", + " header = \"-\" * 80\n", + " banner = header + nl + \"|\" + \"MODEL INFO\".center(78) + \"|\" + nl + header\n", + "\n", + " info = \"\"\n", + " equation = label_name + \" = \"\n", + "\n", + " for index, feature in enumerate(feature_names):\n", + " info = info + \"Weight for feature[{}]: {:.3f}\\n\".format(feature, weights[index][0])\n", + " equation = equation + \"{:.3f} * {} + \".format(weights[index][0], feature)\n", + "\n", + " info = info + \"Bias: {:.3f}\\n\".format(bias[0])\n", + " equation = equation + \"{:.3f}\\n\".format(bias[0])\n", + "\n", + " return banner + nl + info + nl + equation\n", + "\n", + "print(\"SUCCESS: defining plotting functions complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iRluiQhNvTwc" + }, + "source": [ + "## Define functions to build and train a model\n", + "\n", + "The code you need to build and train your model is in the **Define ML functions** code cell. If you would like to explore this code, expand the code cell and take a look.\n", + "\n", + "**Instructions**\n", + "1. Run the **Define ML functions** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "W6a7dtcCob-n" + }, + "outputs": [], + "source": [ + "#@title Code - Define ML functions\n", + "\n", + "def build_model(my_learning_rate, num_features):\n", + " \"\"\"Create and compile a simple linear regression model.\"\"\"\n", + " # Describe the topography of the model.\n", + " # The topography of a simple linear regression model\n", + " # is a single node in a single layer.\n", + " inputs = keras.Input(shape=(num_features,))\n", + " outputs = keras.layers.Dense(units=1)(inputs)\n", + " model = keras.Model(inputs=inputs, outputs=outputs)\n", + "\n", + " # Compile the model topography into code that Keras can efficiently\n", + " # execute. Configure training to minimize the model's mean squared error.\n", + " model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=my_learning_rate),\n", + " loss=\"mean_squared_error\",\n", + " metrics=[keras.metrics.RootMeanSquaredError()])\n", + "\n", + " return model\n", + "\n", + "\n", + "def train_model(model, features, label, epochs, batch_size):\n", + " \"\"\"Train the model by feeding it data.\"\"\"\n", + "\n", + " # Feed the model the feature and the label.\n", + " # The model will train for the specified number of epochs.\n", + " history = model.fit(x=features,\n", + " y=label,\n", + " batch_size=batch_size,\n", + " epochs=epochs)\n", + "\n", + " # Gather the trained model's weight and bias.\n", + " trained_weight = model.get_weights()[0]\n", + " trained_bias = model.get_weights()[1]\n", + "\n", + " # The list of epochs is stored separately from the rest of history.\n", + " epochs = history.epoch\n", + "\n", + " # Isolate the error for each epoch.\n", + " hist = pd.DataFrame(history.history)\n", + "\n", + " # To track the progression of training, we're going to take a snapshot\n", + " # of the model's root mean squared error at each epoch.\n", + " rmse = hist[\"root_mean_squared_error\"]\n", + "\n", + " return trained_weight, trained_bias, epochs, rmse\n", + "\n", + "\n", + "def run_experiment(df, feature_names, label_name, learning_rate, epochs, batch_size):\n", + "\n", + " print('INFO: starting training experiment with features={} and label={}\\n'.format(feature_names, label_name))\n", + "\n", + " num_features = len(feature_names)\n", + "\n", + " features = df.loc[:, feature_names].values\n", + " label = df[label_name].values\n", + "\n", + " model = build_model(learning_rate, num_features)\n", + " model_output = train_model(model, features, label, epochs, batch_size)\n", + "\n", + " print('\\nSUCCESS: training experiment complete\\n')\n", + " print('{}'.format(model_info(feature_names, label_name, model_output)))\n", + " make_plots(df, feature_names, label_name, model_output)\n", + "\n", + " return model\n", + "\n", + "print(\"SUCCESS: defining linear regression functions complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3DQCE2OpH4-" + }, + "source": [ + "## Train a model with one feature\n", + "\n", + "In this step you will train a model to predict the cost of the fare using a **single feature**. Earlier, you saw that `TRIP_MILES` (distance) correlates most strongly with the ``FARE``, so let's start with `TRIP_MILES` as the feature for your first training run.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Run the **Experiment 1** code cell to build your model with one feature.\n", + "1. Review the output from the training run\n", + "1. **Check your understanding** by answering these questions:\n", + " * How many epochs did it take to converge on the final model?\n", + " * How well does the model fit the sample data?\n", + "\n", + "During training, you should see the root mean square error (RMSE) in the output. The units for RMSE are the same as the units for the label (dollars). In other words, you can use the RMSE to determine how far off, on average, the predicted fares are in dollars from the observed values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F_17Aum6IG1F" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 1\n", + "\n", + "# The following variables are the hyperparameters.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "# Specify the feature and the label.\n", + "features = ['TRIP_MILES']\n", + "label = 'FARE'\n", + "\n", + "model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "y8Qnmb0wZ_pQ" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for training model with one feature\n", + "\n", + "# How many epochs did it take to converge on the final model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = \"\"\"\n", + "Use the loss curve to see where the loss begins to level off during training.\n", + "\n", + "With this set of hyperparameters:\n", + "\n", + " learning_rate = 0.001\n", + " epochs = 20\n", + " batch_size = 50\n", + "\n", + "it takes about 5 epochs for the training run to converge to the final model.\n", + "\"\"\"\n", + "print(answer)\n", + "\n", + "# How well does the model fit the sample data?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "It appears from the model plot that the model fits the sample data fairly well.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MYmWW0a9p1ro" + }, + "source": [ + "## Experiment with hyperparameters\n", + "\n", + "It is common with machine learning to run multiple experiments to find the best set of hyperparmeters to train your model. In this step, try varying the hyperparameters one by one with this set of experiments:\n", + "\n", + "* *Experiment 1:* **Increase** the learning rate to **``1``** (batch size at ``50``).\n", + "* *Experiment 2:* **Decrease** the learning rate to **``0.0001``** (batch size at ``50``).\n", + "* *Experiment 3:* **Increase** the batch size to **``500``** (learning rate at ``0.001``).\n", + "\n", + "**Instructions**\n", + "1. Update the hyperparameter values in the **Experiment 2** code cell according to the experiment.\n", + "2. Run the **Experiment 2** code cell.\n", + "3. After the training run, examine the output and note any differences you see in the loss curve or model output.\n", + "4. Repeat steps 1 - 3 for each hyperparameter experiment.\n", + "5. **Check your understanding** by answering these questions:\n", + " * How did raising the learning rate impact your ability to train the model?\n", + " * How did lowering the learning rate impact your ability to train the model?\n", + " * Did changing the batch size effect your training results?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "PdUXEm1xeWcK" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 2\n", + "\n", + "# The following variables are the hyperparameters.\n", + "# TODO - Adjust these hyperparameters to see how they impact a training run.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "# Specify the feature and the label.\n", + "features = ['TRIP_MILES']\n", + "label = 'FARE'\n", + "\n", + "model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "Od7vJJpHiHYB" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for hyperparameter experiments\n", + "\n", + "# How did raising the learning rate impact your ability to train the model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = \"\"\"\n", + "When the learning rate is too high, the loss curve bounces around and does not\n", + "appear to be moving towards convergence with each iteration. Also, notice that\n", + "the predicted model does not fit the data very well. With a learning rate that\n", + "is too high, it is unlikely that you will be able to train a model with good\n", + "results.\n", + "\"\"\"\n", + "print(answer)\n", + "\n", + "# How did lowering the learning rate impact your ability to train the model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "When the learning rate is too small, it may take longer for the loss curve to\n", + "converge. With a small learning rate the loss curve decreases slowly, but does\n", + "not show a dramatic drop or leveling off. With a small learning rate you could\n", + "increase the number of epochs so that your model will eventually converge, but\n", + "it will take longer.\n", + "'''\n", + "print(answer)\n", + "\n", + "# Did changing the batch size effect your training results?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "Increasing the batch size makes each epoch run faster, but as with the smaller\n", + "learning rate, the model does not converge with just 20 epochs. If you have\n", + "time, try increasing the number of epochs and eventually you should see the\n", + "model converge.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o27u0JRj_gJr" + }, + "source": [ + "## Train a model with two features\n", + "\n", + "The model you trained with the feature ``TOTAL_MILES`` demonstrates fairly strong predictive power, but is it possible to do better? In this step, try training the model with two features, ``TRIP_MILES`` and ``TRIP_MINUTES``, to see if you can improve the model. You may recall that the original dataset does not include a feature ``TRIP_MINUTES``, but this feature can be easily derived from ``TRIP_SECONDS`` as shown in the code below.*\n", + "\n", + "**Instructions**\n", + "1. Review the code in **Experiment 3** code cell.\n", + "1. Run the **Experiment 3** code cell.\n", + "1. Review the output from the training run and answer these questions:\n", + " * Does the model with two features produce better results than one using a single feature?\n", + " * Does it make a difference if you use ``TRIP_SECONDS`` instead of ``TRIP_MINUTES``?\n", + " * How well do you think the model comes to the ground truth fare calculation for Chicago Taxi Trips?\n", + "\n", + "\n", + "Notice that the scatter plot of the features vs. the label is a three dimensional (3-D) plot. This representation allows you to visualize both features and the label all together. The two features (TRIP_MILES and TRIP_MINUTES) are on the x and y axis, and the label (FARE) is on the z axis. The plot shows individual examples in the dataset as circles, and the model as a surface (plane). With this 3-D model, if the trained model is good you would expect most of the examples to land on the plane surface. The 3-D plot is interactive so you can explore the data further by clicking or dragging the plot.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "Mg3gUYOoBAtd" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 3\n", + "\n", + "# The following variables are the hyperparameters.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "training_df.loc[:, 'TRIP_MINUTES'] = training_df['TRIP_SECONDS']/60\n", + "\n", + "features = ['TRIP_MILES', 'TRIP_MINUTES']\n", + "label = 'FARE'\n", + "\n", + "model_2 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uFkKK5t33xSX" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for training with two features\n", + "\n", + "# Does the model with two features produce better results than one using a\n", + "# single feature?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "To answer this question for your specific training runs, compare the RMSE for\n", + "each model. For example, if the RMSE for the model trained with one feature was\n", + "3.7457 and the RMSE for the model with two features is 3.4787, that means that\n", + "on average the model with two features makes predictions that are about $0.27\n", + "closer to the observed fare.\n", + "\n", + "'''\n", + "print(answer)\n", + "\n", + "# Does it make a difference if you use TRIP_SECONDS instead of TRIP_MINUTES?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "When training a model with more than one feature, it is important that all\n", + "numeric values are roughly on the same scale. In this case, TRIP_SECONDS and\n", + "TRIP_MILES do not meet this criteria. The mean value for TRIP_MILES is 8.3 and\n", + "the mean for TRIP_SECONDS is 1,320; that is two orders of magnitude difference.\n", + "In contrast, the mean for TRIP_MINUTES is 22, which is more similar to the scale\n", + "of TRIP_MILES (8.3) than TRIP_SECONDS (1,320). Of course, this is not the\n", + "only way to scale values before training, but you will learn about that in\n", + "another module.\n", + "'''\n", + "print(answer)\n", + "\n", + "# How well do you think the model comes to the ground truth fare calculation for\n", + "# Chicago taxi trips?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "In reality, Chicago taxi cabs use a documented formula to determine cab fares.\n", + "For a single passenger paying cash, the fare is calculated like this:\n", + "\n", + "FARE = 2.25 * TRIP_MILES + 0.12 * TRIP_MINUTES + 3.25\n", + "\n", + "Typically with machine learning problems you would not know the 'correct'\n", + "formula, but in this case you can this knowledge to evaluate your model. Take a\n", + "look at your model output (the weights and bias) and determine how well it\n", + "matches the ground truth fare calculation. You should find that the model is\n", + "roughly close to this formula.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MzMfgxldSMGK" + }, + "source": [ + "# Part 4 - Validate Model\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_yW7nVxlO1WY" + }, + "source": [ + "## Use the model to make predictions\n", + "\n", + "Now that you have a trained model, you can use the model to make predictions. In practice, you should make predictions on examples that are not used during training. However, for this exercise, you'll just work with a subset of the same training dataset. In another Colab exercise you will explore ways to make predictions on examples not used in training.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Run the **Define functions to make predictions** code cell.\n", + "1. Run the **Make predictions** code cell.\n", + "1. Review the predictions in the output.\n", + "1. **Check your understanding** by answering these questions:\n", + " * How close is the predicted value to the label value? In other words, does your model accurately predict the fare for a taxi ride?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "XdNxv3j8PGnr" + }, + "outputs": [], + "source": [ + "#@title Code - Define functions to make predictions\n", + "def format_currency(x):\n", + " return \"${:.2f}\".format(x)\n", + "\n", + "def build_batch(df, batch_size):\n", + " batch = df.sample(n=batch_size).copy()\n", + " batch.set_index(np.arange(batch_size), inplace=True)\n", + " return batch\n", + "\n", + "def predict_fare(model, df, features, label, batch_size=50):\n", + " batch = build_batch(df, batch_size)\n", + " predicted_values = model.predict_on_batch(x=batch.loc[:, features].values)\n", + "\n", + " data = {\"PREDICTED_FARE\": [], \"OBSERVED_FARE\": [], \"L1_LOSS\": [],\n", + " features[0]: [], features[1]: []}\n", + " for i in range(batch_size):\n", + " predicted = predicted_values[i][0]\n", + " observed = batch.at[i, label]\n", + " data[\"PREDICTED_FARE\"].append(format_currency(predicted))\n", + " data[\"OBSERVED_FARE\"].append(format_currency(observed))\n", + " data[\"L1_LOSS\"].append(format_currency(abs(observed - predicted)))\n", + " data[features[0]].append(batch.at[i, features[0]])\n", + " data[features[1]].append(\"{:.2f}\".format(batch.at[i, features[1]]))\n", + "\n", + " output_df = pd.DataFrame(data)\n", + " return output_df\n", + "\n", + "def show_predictions(output):\n", + " header = \"-\" * 80\n", + " banner = header + \"\\n\" + \"|\" + \"PREDICTIONS\".center(78) + \"|\" + \"\\n\" + header\n", + " print(banner)\n", + " print(output)\n", + " return" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "PK3oO2kYV8m0" + }, + "outputs": [], + "source": [ + "#@title Code - Make predictions\n", + "\n", + "output = predict_fare(model_2, training_df, features, label)\n", + "show_predictions(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "6sjix7lXI7xT" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for validate model\n", + "\n", + "# How close is the predicted value to the label value?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "Based on a random sampling of examples, the model seems to do pretty well\n", + "predicting the fare for a taxi ride. Most of the predicted values do not vary\n", + "significantly from the observed value. You should be able to see this by looking\n", + "at the column L1_LOSS = |observed - predicted|.\n", + "'''\n", + "print(answer)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [ + "sgR4YRjj5T-b" + ], + "provenance": [], + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file