Home

Notes

  • ⌘k

Amazon Sage Maker



excerpt: "This about all about SageMaker and it's built in Algorthims and how to use them and how to tune them." date: 2025-01-29 category: "AWS ML"

Amazon SageMaker Built-in Algorithms: A Comprehensive Guide to Machine Learning at Scale

Amazon SageMaker stands as AWS's premier machine learning platform, offering a fully managed environment for building, training, and deploying ML models. One of its most powerful features is the extensive collection of built-in algorithms that cover a wide spectrum of machine learning tasks. This comprehensive guide explores each built-in algorithm, their use cases, hyperparameters, and best practices for implementation.

Whether you're a data scientist looking to quickly prototype models or an ML engineer building production systems, understanding SageMaker's built-in algorithms is crucial for leveraging the platform's full potential. Let's dive deep into each algorithm and learn how to harness their power effectively.

Understanding SageMaker Built-in Algorithms

SageMaker's built-in algorithms are pre-implemented, optimized machine learning models that run efficiently on AWS infrastructure. They eliminate the need to write custom code for common ML tasks while providing enterprise-grade performance and scalability.

Key Benefits:

  • No Code Required: Ready-to-use implementations for common ML tasks
  • Optimized Performance: Tuned for AWS infrastructure and distributed training
  • Cost-Effective: Pay only for the compute resources used
  • Scalable: Handle datasets from small to massive scale
  • Integrated: Seamless integration with other SageMaker features

Linear Learner: Foundation of Supervised Learning

Linear Learner is a supervised learning algorithm that can be used for both classification and regression tasks. It's a simple yet powerful algorithm that works well for high-dimensional, sparse data.

  • Key Features:
    • Linear Models: It supports both linear regression and binary/multiclass classification.
    • Automatic Tuning: It automatically tunes the model complexity based on the input data.
    • Scalability: It can handle large datasets and is optimized for distributed training.
    • Built-in Loss Functions: It supports a variety of loss functions like logistic loss, hinge loss, and squared loss.
    • Regularization: It supports L1 and L2 regularization to prevent overfitting.
    • Real-time Predictions: It can be deployed as a real-time endpoint for making predictions.

Input Formats:

  • RecordIO
  • CSV
  • File or Pipe mode both supported

Usage Workflow:

  • Preprocessing: Training data must be normalized so all features have the same scale. Linear Learner automatically normalizes the data. Input data must be shuffled and split into mini-batches.
  • Training: Linear Learner uses stochastic gradient descent (SGD) to optimize the model parameters. It automatically tunes the learning rate and other hyperparameters.
  • Validation: Linear Learner uses the validation data to monitor the model's performance and prevent overfitting. It can automatically stop training if the model's performance degrades.

Critical Hyperparameters:

  • balance_multiclass_weights: Balances class weights in multiclass problems
  • learning_rate: Controls the step size in gradient descent
  • mini_batch_size: Number of samples per mini-batch
  • use_bias: Whether to include a bias term
  • L1: L1 regularization strength
  • target_precision: Precision target for binary classification
  • target_recall: Recall target for binary classification

Recommended Instance Types:

  • Training: Single or Multi CPU instances (GPU doesn't provide significant benefits for linear models)

XGBoost: The Competition Winner

XGBoost is a popular supervised learning algorithm that's known for its speed and performance. It's an implementation of gradient boosted decision trees designed for speed and performance. It's widely used in machine learning competitions and is known for its accuracy and speed.

  • Key Features:
    • Speed: It's optimized for speed and can handle large datasets.
    • Performance: It's known for its accuracy and performance.
    • Regularization: It supports L1 and L2 regularization to prevent overfitting.
    • Customizable: It supports a variety of hyperparameters to customize the model.
    • Classification and Regression: It supports both classification and regression tasks.

Input Formats:

  • libsvm
  • CSV
  • RecordIO
  • Parquet

Usage Workflow:

  • Models are serialized/deserialized using pickle
  • Can use as Framework within SageMaker

Critical Hyperparameters:

  • num_round: Number of boosting rounds
  • max_depth: Maximum depth of the tree
  • eta: Learning rate (can prevent overfitting)
  • gamma: Minimum loss reduction required for further partition
  • min_child_weight: Minimum sum of instance weight needed in a child
  • subsample: Subsample ratio of training instances
  • colsample_bytree: Subsample ratio of columns when constructing each tree
  • colsample_bylevel: Subsample ratio of columns for each level
  • lambda: L2 regularization term on weights
  • alpha: L1 regularization term on weights
  • scale_pos_weight: Controls balance of positive and negative weights

Recommended Instance Types:

  • Training: For memory-bound tasks, consider M5 instances. XGBoost 1.2+ supports GPU training on P2, P3, G4dn, G5 instances. Set tree_method to gpu_hist for faster training.

Seq2Seq: Sequence-to-Sequence Learning

Seq2Seq is a supervised learning algorithm that's used for sequence-to-sequence tasks like machine translation, text summarization, and speech recognition. It's based on recurrent neural networks (RNNs) and is known for its ability to handle variable-length sequences.

  • Key Features:
    • Sequence-to-Sequence: It's designed for sequence-to-sequence tasks like machine translation and text summarization.
    • Recurrent Neural Networks: It's based on RNNs and can handle variable-length sequences.
    • Customizable: It supports a variety of hyperparameters to customize the model.
    • Bidirectional RNNs: It supports bidirectional RNNs to capture context from both directions.
    • Attention Mechanism: It supports attention mechanisms to focus on relevant parts of the input sequence.

Input Format:

  • RecordIO-Protobuf (tokens must be integers)
  • Start with tokenized text files, convert to protobuf using sample code
  • Must provide training data, validation data, and vocabulary files

Usage Workflow:

  • Preprocessing: Tokenize input sequences and convert to integer tensors. Pad sequences to same length. Convert to RecordIO-Protobuf format.
  • Training: Uses RNNs to learn relationship between input and output sequences. Uses teacher forcing to predict next token.
  • Validation: Monitors performance and prevents overfitting.

Critical Hyperparameters:

  • num_layers: Number of RNN layers
  • hidden_dim: Dimension of hidden state
  • dropout: Dropout rate
  • learning_rate: Learning rate
  • batch_size: Batch size
  • max_seq_len: Maximum sequence length
  • attention_type: Attention mechanism type
  • bidirectional: Use bidirectional RNNs
  • use_masking: Handle variable-length sequences

Recommended Instance Types:

  • Training: Single GPU instance only (multi-GPU doesn't help)

DeepAR: Time Series Forecasting

DeepAR is a supervised learning algorithm that's used for time series forecasting. It's based on recurrent neural networks (RNNs) and is known for its ability to handle variable-length sequences.

  • Key Features:
    • Time Series Forecasting: It's designed for time series forecasting tasks like sales forecasting and demand forecasting.
    • Recurrent Neural Networks: It's based on RNNs and can handle variable-length sequences.
    • Customizable: It supports a variety of hyperparameters to customize the model.
    • Probabilistic Forecasting: It supports probabilistic forecasting to capture uncertainty in the predictions.
    • Autoregressive Model: It supports autoregressive models to capture the temporal dependencies in the time series.

Input Format:

  • JSON lines format (gzip or Parquet)
  • Each record must have timestamp and target value
  • Can include additional features
  • Must provide training, validation, and test data

Usage Best Practices:

  • Always include entire time series for training, testing, and inference
  • Use entire dataset as training set, remove last time points for testing
  • Don't use very large values for prediction length (> 400)
  • Train on many time series when possible
  • Use large context length to capture long-term dependencies

Critical Hyperparameters:

  • context_length: Length of context
  • prediction_length: Length of prediction
  • num_layers: Number of RNN layers
  • hidden_dim: Dimension of hidden state
  • dropout: Dropout rate
  • learning_rate: Learning rate
  • batch_size: Batch size
  • likelihood: Likelihood function
  • cardinality: Cardinality of categorical features
  • embedding_dimension: Dimension of categorical embeddings

Recommended Instance Types:

  • Training: Single GPU or CPU instances (GPU may not be necessary)

BlazingText: Fast Text Processing

BlazingText is a supervised learning algorithm that's used for word embeddings and text classification tasks. It's based on the Word2Vec algorithm and is known for its speed and scalability.

  • Key Features:
    • Word Embeddings: It can be used to generate word embeddings, which are dense vector representations of words.
    • Text Classification: It supports text classification tasks like sentiment analysis and topic classification.
    • Speed and Scalability: It's optimized for speed and can handle large datasets.
    • Customizable: It supports a variety of hyperparameters to customize the model.

Input Format:

  • Text data in RecordIO or CSV format

Usage Workflow:

  • Preprocessing: Text data is tokenized and converted to integer tensors
  • Training: Uses Word2Vec algorithm to learn word embeddings and classification model
  • Validation: Monitors model performance and prevents overfitting

Critical Hyperparameters:

  • mode: Operation mode (continuous_bag_of_words or skip_gram)
  • vector_dim: Dimension of word embeddings
  • learning_rate: Learning rate
  • batch_size: Batch size

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: CPU instances

Object2Vec: Graph Embeddings

Object2Vec is an unsupervised learning algorithm that's used for generating vector representations of objects in a graph or network. It's based on the Word2Vec algorithm and is known for its ability to capture the relationships between objects.

  • Key Features:
    • Object Embeddings: It can be used to generate vector representations of objects in a graph or network.
    • Capture Relationships: It can capture the relationships between objects based on their connections in the graph.
    • Scalability: It's designed to handle large graphs and networks.
    • Customizable: It supports a variety of hyperparameters to customize the model.

Input Format:

  • Graph data in supported format (CSV, JSON)

Usage Workflow:

  • Preprocessing: Convert graph data to suitable format
  • Training: Uses Word2Vec algorithm to learn object embeddings
  • Validation: Monitor model performance

Critical Hyperparameters:

  • vector_dim: Dimension of object embeddings
  • learning_rate: Learning rate
  • batch_size: Batch size
  • window_size: Context window size

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: CPU instances

Computer Vision Algorithms

Object Detection

Object Detection is a computer vision algorithm that's used for detecting and locating objects in images or videos. It's based on deep learning techniques and is widely used in applications like self-driving cars, surveillance systems, and image analysis.

  • Key Features:
    • Object Detection: It can detect and locate multiple objects in an image or video.
    • Bounding Boxes: It provides bounding boxes around the detected objects, along with their class labels.
    • Deep Learning: It's based on deep learning techniques like convolutional neural networks (CNNs).
    • Transfer Learning: It supports transfer learning, allowing pre-trained models to be fine-tuned on new datasets.

Input Format:

  • Images or videos in supported format (PNG, JPEG, MP4)

Usage Workflow:

  • Preprocessing: Resize and normalize images/videos
  • Training: Train on labeled datasets with bounding boxes and class labels
  • Validation: Evaluate using precision, recall, and mAP metrics

Critical Hyperparameters:

  • batch_size: Batch size for training
  • learning_rate: Learning rate
  • optimizer: Optimization algorithm
  • backbone_model: Pre-trained CNN model

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: Single GPU or CPU instances

Image Classification

Image Classification is a computer vision algorithm that's used for classifying images into predefined categories or classes. It's based on deep learning techniques and is widely used in applications like image recognition, content moderation, and medical imaging.

  • Key Features:
    • Image Classification: It can classify images into predefined categories or classes.
    • Deep Learning: It's based on deep learning techniques like convolutional neural networks (CNNs).
    • Transfer Learning: It supports transfer learning, allowing pre-trained models to be fine-tuned on new datasets.
    • Scalability: It can handle large datasets and is optimized for distributed training.

Input Format:

  • Images in supported format (PNG, JPEG)

Usage Workflow:

  • Preprocessing: Resize and normalize images
  • Training: Train on labeled datasets with class labels
  • Validation: Evaluate using accuracy, precision, and recall

Critical Hyperparameters:

  • batch_size: Batch size for training
  • learning_rate: Learning rate
  • optimizer: Optimization algorithm
  • backbone_model: Pre-trained CNN model

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: Single GPU or CPU instances

Semantic Segmentation

Semantic Segmentation is a computer vision algorithm that's used for segmenting an image into different regions or objects, assigning a class label to each pixel. It's based on deep learning techniques and is widely used in applications like self-driving cars, medical imaging, and image analysis.

  • Key Features:
    • Pixel-level Segmentation: It can segment an image into different regions or objects, assigning a class label to each pixel.
    • Deep Learning: It's based on deep learning techniques like convolutional neural networks (CNNs) and encoder-decoder architectures.
    • Transfer Learning: It supports transfer learning, allowing pre-trained models to be fine-tuned on new datasets.
    • Scalability: It can handle large datasets and is optimized for distributed training.

Input Format:

  • Images in supported format (PNG, JPEG)

Usage Workflow:

  • Preprocessing: Resize and normalize images
  • Training: Train on labeled datasets with pixel-level annotations
  • Validation: Evaluate using mIoU and pixel accuracy

Critical Hyperparameters:

  • batch_size: Batch size for training
  • learning_rate: Learning rate
  • optimizer: Optimization algorithm
  • backbone_model: Pre-trained CNN model
  • encoder_decoder_architecture: Architecture type

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: Single GPU or CPU instances

Unsupervised Learning Algorithms

Random Cut Forest: Anomaly Detection

Random Cut Forest is an unsupervised learning algorithm that's used for anomaly detection and data outlier detection. It's based on the concept of decision trees and is known for its ability to handle high-dimensional data and its robustness to noise.

  • Key Features:
    • Anomaly Detection: It can be used to detect anomalies or outliers in data.
    • Unsupervised Learning: It does not require labeled data for training.
    • High-Dimensional Data: It can handle high-dimensional data efficiently.
    • Robustness: It's robust to noise and missing values in the data.

Input Format:

  • Tabular data in supported format (CSV, RecordIO)

Usage Workflow:

  • Preprocessing: Preprocess and normalize data if necessary
  • Training: Build ensemble of decision trees by random partitioning
  • Inference: Score new data points based on similarity to trees

Critical Hyperparameters:

  • num_trees: Number of trees in ensemble
  • sample_size: Sample size for building each tree
  • num_dimensions_per_node: Dimensions for splitting
  • shingle_dim: Dimension for data shingles

Recommended Instance Types:

  • Training: Single or multi-CPU instances
  • Inference: Single CPU instances

LDA: Topic Modeling

LDA is an unsupervised learning algorithm that's used for topic modeling, which involves discovering the underlying topics or themes in a collection of documents. It's based on a probabilistic model and is widely used in applications like text analysis, information retrieval, and content recommendation.

  • Key Features:
    • Topic Modeling: It can be used to discover the underlying topics or themes in a collection of documents.
    • Unsupervised Learning: It does not require labeled data for training.
    • Probabilistic Model: It's based on a probabilistic model that represents each document as a mixture of topics.
    • Scalability: It can handle large collections of documents.

Input Format:

  • Text data in supported format (CSV, RecordIO)

Usage Workflow:

  • Preprocessing: Preprocess, tokenize, and convert text data
  • Training: Learn topic distributions and word distributions
  • Inference: Map new documents to learned topics

Critical Hyperparameters:

  • num_topics: Number of topics to learn
  • alpha: Dirichlet prior for document-topic distribution
  • beta: Dirichlet prior for topic-word distribution
  • max_iterations: Maximum training iterations

Recommended Instance Types:

  • Training: Single or multi-CPU instances
  • Inference: Single CPU instances

Instance-Based Algorithms

KNN: K-Nearest Neighbors

KNN is a supervised learning algorithm that's used for classification and regression tasks. It's a non-parametric algorithm that makes predictions based on the similarity or distance between the new data point and the existing data points in the training set.

  • Key Features:
    • Classification and Regression: It can be used for both classification and regression tasks.
    • Non-parametric: It does not make any assumptions about the underlying data distribution.
    • Similarity-based: Predictions are made based on the similarity or distance between data points.
    • Interpretability: The algorithm is relatively simple and interpretable.

Input Format:

  • Tabular data in supported format (CSV, RecordIO)

Usage Workflow:

  • Preprocessing: Preprocess and normalize data if necessary
  • Training: Store entire training dataset
  • Inference: Find k nearest neighbors and make predictions

Critical Hyperparameters:

  • k: Number of nearest neighbors
  • distance_metric: Distance metric (Euclidean, Manhattan, Cosine)
  • weight_function: Weight function for neighbors

Recommended Instance Types:

  • Training: Single CPU instances
  • Inference: Single CPU instances

K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm that's used for clustering or partitioning data into K distinct groups or clusters. It's based on the idea of minimizing the sum of squared distances between data points and their assigned cluster centroids.

  • Key Features:
    • Clustering: It can be used to group or cluster data points based on their similarities.
    • Unsupervised Learning: It does not require labeled data for training.
    • Iterative Optimization: It iteratively refines the cluster assignments and centroids to minimize the sum of squared distances.
    • Scalability: It can handle large datasets and is optimized for distributed training.

Input Format:

  • Tabular data in supported format (CSV, RecordIO)

Usage Workflow:

  • Preprocessing: Preprocess and normalize data if necessary
  • Training: Initialize centroids and iteratively assign points and update centroids
  • Inference: Assign new points to closest cluster

Critical Hyperparameters:

  • k: Number of clusters
  • init_mode: Centroid initialization method
  • max_iterations: Maximum iterations
  • tolerance: Convergence tolerance

Recommended Instance Types:

  • Training: Single or multi-CPU instances
  • Inference: Single CPU instances

Specialized Algorithms

IP Insights

IP Insights is a machine learning algorithm provided by Amazon SageMaker that is used for detecting and classifying entities, such as names, locations, and organizations, within text data. It is based on natural language processing (NLP) techniques and can be useful for tasks like named entity recognition, text analysis, and information extraction.

  • Key Features:
    • Entity Detection and Classification: It can detect and classify entities like names, locations, organizations, and more within text data.
    • Pre-trained Models: It provides pre-trained models that can be fine-tuned on specific data or used out-of-the-box.
    • Scalability: It can handle large volumes of text data and is optimized for distributed processing.
    • Customizable: It supports fine-tuning the pre-trained models on custom data to improve performance for specific use cases.

Input Format:

  • Text data in supported format (CSV, JSON)

Usage Workflow:

  • Preprocessing: Clean and preprocess text data
  • Training (Optional): Fine-tune pre-trained models
  • Inference: Detect and classify entities

Critical Hyperparameters:

  • model_type: Pre-trained model type
  • max_length: Maximum input sequence length
  • batch_size: Inference batch size
  • score_threshold: Confidence threshold

Recommended Instance Types:

  • Training: Single or multi-GPU instances
  • Inference: Single GPU or CPU instances

PCA: Dimensionality Reduction

PCA is an unsupervised learning algorithm that is used for dimensionality reduction and feature extraction in machine learning. It is a technique that transforms high-dimensional data into a lower-dimensional space while retaining as much of the original data's variance as possible.

  • Key Features:
    • Dimensionality Reduction: It can reduce the number of features or dimensions in the data while preserving the most important information.
    • Feature Extraction: It can extract new, uncorrelated features from the original data.
    • Visualization: It can be used for visualizing high-dimensional data in a lower-dimensional space.
    • Noise Reduction: It can help reduce noise and redundancy in the data.

Input Format:

  • Tabular data in supported format (CSV, RecordIO)

Usage Workflow:

  • Preprocessing: Preprocess and normalize data if necessary
  • Training: Calculate principal components
  • Inference: Project new data onto principal components

Critical Hyperparameters:

  • n_components: Number of principal components
  • whiten: Whether to perform whitening
  • svd_solver: SVD algorithm

Recommended Instance Types:

  • Training: Single CPU instances
  • Inference: Single CPU instances

Hyperparameter Tuning: Optimizing Model Performance

As you've seen, each algorithm has its own hyperparameters, and nobody can tell you the best hyperparameters for your model. You need to tune the hyperparameters to get the best model.

What is Hyperparameter Tuning?

Hyperparameter tuning, also known as hyperparameter optimization, is the process of finding the best set of hyperparameters for a machine learning model.

Why is Hyperparameter Tuning Important?

The performance of a machine learning model is highly dependent on the choice of hyperparameters.

Automatic Model Tuning Best Practices

When tuning your machine learning model automatically, follow these best practices:

1. Optimize Selectively

  • Rule: Don't tweak too many settings at once.
  • Explanation: Adjust a few hyperparameters at a time to understand their impact better.

2. Restrict Parameter Ranges

  • Rule: Keep parameter ranges as narrow as possible.
  • Explanation: Focusing on specific values enhances the efficiency of the tuning process.

3. Logarithmic Scales

  • Rule: Use logarithmic scales when appropriate.
  • Explanation: Logarithmic scaling helps explore a wide range of values, especially for parameters with large value ranges.

4. Manage Concurrent Jobs

  • Rule: Limit the number of simultaneous training jobs.
  • Explanation: Running too many jobs concurrently may hinder the learning process. Opt for a balanced approach.

5. Ensure Metric Consistency

  • Rule: Confirm consistent reporting of objective metrics across multiple instances.
  • Explanation: Accurate metrics ensure reliable evaluation of model performance, especially when utilizing multiple computing instances.

Remember, these practices contribute to a more effective and efficient automatic model tuning process.

SageMaker Integration with Big Data Tools

For comprehensive information about SageMaker's integration with Apache Spark and big data processing, refer to the following guides:

  1. Data LakesData LakesData Lake Introduction to data lakes Why Data Lakes? 1. Are a great way to store and analyze large volumes of data at a low cost. 1. Exponential growth of data. 1. unstructured data. - Text - Images - Videos - Audio 1. Need analytics faster - Days - Hours - Real time Benefits of Data Lakes Data warehouse vs Data Lake Data Lake Analytics Functionality vs Aws This Course will cover AWS Glue AWS Glue is a fully managed extract, transform, and load (ETL)
  2. Data EngineeringData EngineeringData Engineering S3 S3 Storage Classes S3 Standard**: For general-purpose storage of frequently accessed data; designed for durability of 99.999999999% of objects over a given year. S3 Intelligent-Tiering**: Designed to optimize costs by automatically moving data to the most cost-effective access tier, without performance impact or operational overhead. S3 Standard-IA**: For long-lived, infrequently accessed data that needs to be accessed rapidly when needed. S3 One Zone-IA**: For long-lived
  3. Building Data AnalyticsBuilding Data AnalyticsBuilding Batch Data Analytics Solutions on AWS In this guide, we'll explore how to build batch data analytics solutions using Amazon EMR (Elastic MapReduce). Amazon EMR is a cloud-based big data platform that allows you to process large datasets efficiently. We'll cover various modules to help you understand the key components and best practices for designing and implementing batch analytics solutions. Module 1: Introduction to Amazon EMR What is Amazon EMR? Amazon EMR is a managed big data

Latest SageMaker Features and Capabilities

For information about the newest features and advanced capabilities in Amazon SageMaker, see:

  1. Practical Data ScienceParctical Data SciencePractical Data Science: From Theory to Production Data science bridges the gap between raw data and actionable business insights. This comprehensive guide explores practical data science methodologies, machine learning workflows, and production deployment strategies. Whether you're a beginner or an experienced practitioner, understanding these concepts will help you build robust, scalable data science solutions. The Data Science Process: A Systematic Approach Data science is both an art and

Conclusion: Choosing the Right Algorithm

Selecting the appropriate SageMaker built-in algorithm depends on several factors:

  • Problem Type: Classification, regression, clustering, or recommendation
  • Data Characteristics: Structured vs. unstructured, time series, text, images
  • Performance Requirements: Speed, accuracy, scalability
  • Resource Constraints: CPU vs. GPU, memory requirements
  • Interpretability Needs: Model explainability requirements

Decision Framework:

  1. Define your problem: Classification, regression, clustering, etc.
  2. Assess your data: Type, size, quality, features
  3. Consider requirements: Latency, throughput, accuracy
  4. Evaluate algorithms: Compare based on your criteria
  5. Prototype and iterate: Test with your data, tune hyperparameters
  6. Scale and deploy: Move to production with SageMaker endpoints

SageMaker's built-in algorithms provide a solid foundation for most machine learning tasks, allowing you to focus on solving business problems rather than implementing complex algorithms from scratch. As you gain experience, you'll develop intuition for which algorithms work best for different scenarios.

Remember, the key to successful machine learning is not just choosing the right algorithm, but also proper data preparation, feature engineering, and continuous model monitoring and improvement.


Comprehensive guide to Amazon SageMaker built-in algorithms, covering implementation details, hyperparameters, and best practices for machine learning at scale.