Preparing the data

Christian Rápalo — Fri, 17 Sep 2021 05:44:41 GMT

We are going to create a web app that can classify digits from 0 to 9.
To achieve this, we are going to train a machine learning model with the mnist dataset, which contains 70,000 black and white images of handwritten digits in total. Each image is 28 x 28.
Also, we are going to create a web app where we can draw a digit, and will use our model to predict the number.

This is the first part of the series, where we are going to set up the environment, get the data, normalize it, and plot some numbers.

Tools we'll be using

We will be using a set of tools and libraries that will make it easier to create machine learning models and plot the results.

These tools are:

Google Colab
Tensorflow
NumPy
Matplotlib

Setting up the environment

We will be using Google Colab, which is a Jupyter-like Python environment that lets you create notebooks and execute python code on the cloud.

To get started with Google Colab you’ll only need a Google account, and go to https://colab.research.google.com/

Here it will prompt this modal:

And to create a new notebook you need to click on the “New Notebook” button.

This will create a new environment to execute your code.

Geting the data

As mentioned before, we will be creating a mnist classifier.
Lucky for us, mnist is a popular dataset that has been used a lot before to try new machine learning models, so the team developing Tensorflow has included that dataset into Tensorflow, and that makes it easy to fetch it.

First, we need to import all the libraries we will be using.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
import numpy as np 
import matplotlib.pyplot as plt

As you can see, we imported a lot of things which I’ll be explaining right now:

Tensorflow: is the library that contains all we need to create deep learning models.

NumPy: is a numerical library that lets us execute mathematical operations in arrays and matrices.

Matplotlib: is a plotting library for making amazing charts.

After importing all the libraries, we need to fetch the dataset.
If you import the mnist dataset from Tensorflow, it's already divided into 2 different sets for us, 60,000 images for the training set, and 10,000 images for the test set.
Before building a model, it’s always good to make this division of training data and testing data, because it will let us verify if our model was trained correctly, or if it’s overfitted or underfitted. Sometimes you will see a third set which is a validation set, but we will not do that here.

To import the two sets of data we do:

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Now that we have the data, we have to normalize the images.
A black and white image is a matrix of pixels with values from 0 to 255 (being 0 black and 255 white). As you can see, there’s a long gap between the two values. We want to reduce this gap without losing too much information about the image.
A popular way to do this is by dividing each pixel by 255. This is because if we do that, we will get a matrix of values from 0 to 1. Because 0/255 is equal to 0 and 255/255 is equal to 1.
Normalizing the images will reduce the training time and normally will improve the accuracy.

To do this we do:

x_train, x_test = x_train / 255, x_test / 255

Analyzing the data

For other machine learning problems, you will probably need to deeply analyze the data and fix it, by this I mean to plot a distribution chart to see if every possible output is balanced, see if there are null or empty values in the dataset, etc...
But because mnist is a popular dataset, it has already been analyzed and fixed, so we don’t need to do too much here.

But what we will do is plot an image of the training set to see a little bit how they are, and also print the dimensions of the data because we will need this later.

To plot an image, we will use Matplotlib which we have already imported, and execute this:

index = 5
plt.imshow(x_train[index])
plt.axis('off')
plt.title(f'Real Value: {y_train[index]}')

The result we will be getting is something like this:

As mentioned before, mnist is a set of black and white handwritten images, but the plot shows some colors.
That’s because when plotting, we need to specify that our dataset is grayscale.

index = 5
plt.imshow(x_train[index], cmap='gray')
plt.axis('off')
plt.title(f'Real Value: {y_train[index]}')

Now you can see it’s plotting the image in the right way:

You can change the displayed image by changing the index.

Now what we need to know is the dimension of the dataset. By this, I mean how many images do we have, and what is the size of those images (I have already told you the dimensions, but we will imagine that we don’t know that).

We can print this by doing:

print(x_train.shape)

Result: (60000, 28, 28)

With that, we can know that we have 60,000 training images with a size of 28 x 28 each.

Congrats!

Hopefully you have learned a few things with this first part.
Knowing the basics will help you at the moment when you start working with custom datasets.
What I want you to remember is to always divide your data into 2 sets and preprocess it before starting to build the model.
If your data is “garbage”, your model will also be “garbage”, so it’s important to spend quite some time preprocessing it.
As I said before, here we didn’t do too much preprocessing because the mnist was created to try new models without spending too much time working with the data, the only thing we did was normalize the images.

In the second part, we will build and train the model.