Machine Learning Fundementals
Machine Learning Fundementals
It is important to gain a grasp around the understanding of Machine Learning and the creating / building of Models. As we accelerate ever faster into areas of implementation, administration and governance over AI - there will be more and more references to Models: who created them, on what data they were trained and how they are deployed.
As an example, within the documentation specific to AI risks, the EUAI-Act refers to General Purpose AI (GPAI) models and systems as follows:
GPAI model means an AI model, including when trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable to competently perform a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications. This does not cover AI models that are used before release on the market for research, development and prototyping activities.
GPAI system means an AI system which is based on a general purpose AI model, that has the capability to serve a variety of purposes, both for direct use as well as for integration in other AI systems. GPAI systems may be used as high risk AI systems or integrated into them. GPAI system providers should cooperate with such high risk AI system providers to enable the latter’s compliance.
The Model:
Imagine a model as the end product of your machine learning project. It's a mathematical representation of the data you've analyzed. This model can be used to make predictions on new, unseen data. Think of it like a trained expert that can identify patterns and make informed guesses based on what it has learned.
The Algorithm:
An algorithm acts like a blueprint for building the model. It provides a step-by-step process for the computer to follow as it analyzes the data and learns from it. Different algorithms are suited for different tasks, such as classification (sorting things into categories) or regression (predicting continuous values). As the algorithm processes the data, it refines itself to improve the model's accuracy over time.
Workflow Steps:
Data Collection: Gather the data you'll use to train the model.
Data Processing (cleaning and preparing the data): Ensure the data is usable for the algorithm by cleaning and formatting it.
Model Selection (based on the problem and data type): Choose the appropriate algorithm type (e.g., classification, regression) considering your project's goal and data characteristics. You typically don't write the algorithm from scratch for most machine learning tasks; there are established algorithms available for various purposes.
Model Training: Apply the chosen algorithm to the Training data to build the model. This is where the learning happens.
Model Evaluation: Test the model's performance on unseen Test data to assess its effectiveness and identify areas for improvement.
Model Refinement (tuning): Based on the evaluation, you can adjust the algorithm's parameters or the data pre-processing steps to optimize the model's performance. This is often an iterative process.
Prediction: Once satisfied with the model's performance, use it to make predictions on new, unseen data.
Training Data Set:
Imagine it as the study material for your machine learning model.
This data set is used to teach the model how to identify patterns and make predictions.
The model is exposed to the training data set during the training phase.
It's typically the larger portion of your overall data collection.
The more data you have for training, the better the model can learn and generalize to unseen data.
Test Data Set:
Think of it as the final exam for your trained model.
This data set is used to evaluate the model's performance on unseen data.
The model is not exposed to the test data set during training.
Ideally, the test data set should come from the same probability distribution as the training data to ensure it reflects real-world scenarios.
By testing on unseen data, you can assess how well the model generalizes its learnings and avoids overfitting to the specific training data.
In essence, the algorithm is the tool, and the model is the crafted product that allows you to make predictions on future data.