Jaime López
Data Product Developer
Centereach, NY



The basic machine learning setting

Nowadays, machine learning and artificial intelligence are creating a lot of hype in news. OpenAI’s ChatGTP has reached a wide audience, becoming part of everyday conversations. People everywhere are asking what AI bots can do and what that means for our societies. However, at this moment, the truth is that machines can only do that they can learn; and that is only possible by having a model to represent the desired relationships and by letting it see labeled examples in a structured and cumulative way. To appreciate this assertion better, in this note I explain in basic terms the machine learning setting.

By structured and cumulative learning, I want to say a defined format for inputs and outputs and using enough training examples until a good approximation is reached. In essence, it is an optimization objective, to reach that the machine can predict labels for unseen examples with a minimal error. It is assumed that those unseen examples will come from the same or a similar probability distribution than the one where training examples were taken.

Let's start defining what a dataset is from a machine learning perspective. It is a list of labeled examples of data. A label is the kind of output for an example. Whether a data example is a picture of an object, its label can be the category that describes that object. Let's say that we have pictures of dogs and cats, so the label for each picture can be "dog" or "cat". More precisely, a dataset is a sequence $j=1, 2, \dots, m$ of ordered pairs, each one having an example $x_j$ and its label $y_j$.

$$ D = {(x_j, y_j), j = 1, 2, \dots, m}$$

The problem for machine learning is to empirically build models that represent a function for mapping data examples to labels. That function has to work satisfactorily with training examples as well as with similar but unseen examples. When I say empirically, I mean going from particular cases to a generalization, that is, learning from examples. It is recognized that a definitive function can not be found, so models are just good approximations. One model will perform better in a given context, while others will do in a different one. That's data scientists do, evaluating which models work better and looking for new ones.

$$y_j \approx \phi(x_j|w), j = 1, 2, \dots, m$$

In the previous equation, $\phi$ is a model that approximates a function that maps the example $x_j$ to its label $y_j$. Learning for the model $\phi$ is to iteratively find the parameters $w$ that produce a good approximation. In reality, we cannot be sure what the best approximation is; even though the model exactly predicts the examples in the dataset, there will always be unseen examples for which we don't know if a model works. Therefore, it is an experimental process on improving models over time through trial and error.

To find values for the parameters $w$, a loss function $\ell$ is used. It's about a comparison between a predicted label $\hat{y}_j$ for an example $x_j$ and its real label $y_j$, to measure how well they correspond. In consequence, training is to find values for $w$ that minimize the average loss $\mathcal{L}$. The training process generally starts with random values for $w$, going iteration by iteration seeing examples, measuring losses, and adjusting values for $w$ until the average loss is minimal.

$$\mathcal{L} = \frac{1}{m} \sum_{j = 1}^m \ell (\hat{y}_j, y_j)$$

This is the basic machine learning setting: a dataset of labeled examples, a model that can represent the mapping function, a training process to find values for the model parameters, and an evaluation of losses on the training dataset and on examples unseen by the model (a test dataset).

Of course, many variations exist of the basic setting. There are problems on which data is not labeled, so there are methods to assign labels based on the similarity between examples. The nature of data can be simple, like a list of numbers, or complex, as pixels in images, words in texts, and frequencies in sounds. Models can be deterministic or generative, some of them requiring enormous infrastructure and time for training.

No doubt that the immediate future will show us more impressive results coming from the community of machine learning and artificial intelligence practitioners. Knowing the basic setting, from these outcomes come, helps us to understand better their limitations and potentials, and more importantly, how they can impact our lives.

Date: Feb. 23, 2023