.. -------------------------------------------------------------
.. 
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at
.. 
..   http://www.apache.org/licenses/LICENSE-2.0
.. 
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.
.. 
.. ------------------------------------------------------------

Built-in Algorithms 
===================

Prerequisite: 

- :doc:`/getting_started/install`

This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
since it is commonly known and explored.

If one wants to skip the explanation then the full script is available at the bottom of this page.

Step 1: Get Dataset
-------------------

SystemDS provides builtin for downloading and setup of the MNIST dataset.
To setup this simply use

.. code-block:: python

    from systemds.examples.tutorials.mnist import DataManager
    d = DataManager()
    X = d.get_train_data()
    Y = d.get_train_labels()

Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.

Step 2: Reshape & Format
------------------------

Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
realistic some data preprocessing is required to change the input to fit.

First the training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
and finally the column pixels, 28.

To use this data for logistic regression we have to reduce the dimensions.
The input X is the training data. 
It require the data to have two dimensions, the first resemble the
number of inputs, and the other the number of features.

Therefore to make the data fit the algorithm we reshape the X dataset, like so

.. code-block:: python

    X = X.reshape((60000, 28*28))

This takes each row of pixels and append to each other making a single feature vector per image.

The Y dataset also does not perfectly fit the logistic regression algorithm, this is because the labels
for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.

Therefore we add 1 to each label such that the labels go from 1 to 10, like this

.. code-block:: python

    Y = Y + 1

With these steps we are now ready to train a simple model.

Step 3: Training
----------------

To start with, we setup a SystemDS context

.. code-block:: python

    from systemds.context import SystemDSContext
    sds = SystemDSContext()

Then setup the data

.. code-block:: python

    X_ds = sds.from_numpy(X)
    Y_ds = sds.from_numpy( Y)

to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
to train on a smaller sample to start with

.. code-block:: python

    sample_size = 1000
    X_ds = sds.from_numpy(X[:sample_size])
    Y_ds = sds.from_numpy(Y[:sample_size])

And now everything is ready for our algorithm

.. code-block:: python

    from systemds.operator.algorithm import multiLogReg
    bias = multiLogReg(X_ds, Y_ds)

Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute

.. code-block:: python

    bias_r = bias.compute()

bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
This is the matrix that could be saved and used for predicting labels later.

Step 3: Validate
----------------

To see what accuracy the model achieves, we have to load in the test dataset as well.

this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined

.. code-block:: python

    Xt = sds.from_numpy(d.get_test_data().reshape((10000, 28*28)))
    Yt = sds.from_numpy(d.get_test_labels()) + 1

The above loads the test data, and reshapes the X data the same way the training data was reshaped.

Finally we verify the accuracy by calling

.. code-block:: python

    from systemds.operator.algorithm import multiLogRegPredict
    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
    print(acc)

There are three outputs from the multiLogRegPredict call.

- m, is the mean probability of correctly classifying each label.
- y_pred, is the predictions made using the model, bias, trained.
- acc, is the accuracy achieved by the model.

If the subset of the training data is used then you could expect an accuracy of 85% in this example
using 1000 pictures of the training data.

Step 4: Tuning
--------------

Now that we have a working baseline we can start tuning parameters.

But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data.
This gives an indication of if we have exhausted the learning potential of the training data.

To see how our accuracy is on the training data we use the Predict function again, but with our training data

.. code-block:: python

    [m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
    print(acc)

In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data,
and have nothing more to learn from the data as it is now.

To improve further we have to increase the training data, here for example we increase it
from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes,
to again reduce training time

.. code-block:: python

    X_ds = sds.from_numpy(X)
    Y_ds = sds.from_numpy(Y)

    bias = multiLogReg(X_ds, Y_ds, maxi=30)

    [_, _, train_acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
    [_, _, test_acc] = multiLogRegPredict(Xt, bias, Yt).compute()
    print(train_acc, "  ", test_acc)

With this change the accuracy achieved changes from the previous value to 92%. This is still low on this dataset as can be seen on `MNIST <http://yann.lecun.com/exdb/mnist/>`_.
But this is a basic implementation that can be replaced by a variety of algorithms and techniques.


Full Script
-----------

The full script, some steps are combined to reduce the overall script. 
One noteworthy change is the + 1 is done on the matrix ready for SystemDS,
this makes SystemDS responsible for adding the 1 to each value.

.. code-block:: python

    from systemds.context import SystemDSContext
    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict
    from systemds.examples.tutorials.mnist import DataManager

    d = DataManager()

    with SystemDSContext() as sds:
        # Train Data
        X = sds.from_numpy(d.get_train_data().reshape((60000, 28*28)))
        Y = sds.from_numpy(d.get_train_labels()) + 1.0
        bias = multiLogReg(X, Y, maxi=30)
        # Test data
        Xt = sds.from_numpy(d.get_test_data().reshape((10000, 28*28)))
        Yt = sds.from_numpy(d.get_test_labels()) + 1.0
        [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()

    print(acc)