Parctical Data Science

January 29, 2025

Practical Data Science: From Theory to Production

Data science bridges the gap between raw data and actionable business insights. This comprehensive guide explores practical data science methodologies, machine learning workflows, and production deployment strategies. Whether you're a beginner or an experienced practitioner, understanding these concepts will help you build robust, scalable data science solutions.

The Data Science Process: A Systematic Approach

Data science is both an art and a science, requiring technical expertise, domain knowledge, and business acumen. The process involves transforming raw data into meaningful insights through a structured methodology.

Core Data Science Activities

Data Collection and Integration:

Gathering data from multiple sources
Ensuring data quality and consistency
Building scalable data pipelines

Data Exploration and Understanding:

Statistical analysis and visualization
Identifying patterns and anomalies
Feature engineering and selection

Model Development:

Algorithm selection and implementation
Hyperparameter tuning and optimization
Cross-validation and performance evaluation

Production Deployment:

Model serving and inference
Monitoring and maintenance
Continuous improvement and iteration

Machine Learning Use Cases and Applications

Machine learning powers many modern applications across industries. Understanding common use cases helps in selecting appropriate approaches and evaluating success metrics.

Classification Problems

Binary Classification: Predicting one of two possible outcomes

Email Spam Detection: Classify emails as spam or legitimate
Credit Risk Assessment: Determine loan approval likelihood
Medical Diagnosis: Predict disease presence/absence
Customer Churn Prediction: Identify customers likely to leave

Multiclass Classification: Predicting one of multiple categorical outcomes

Image Classification: Categorize objects in images
Document Classification: Sort documents by topic or type
Sentiment Analysis: Determine emotional tone of text
Product Categorization: Assign products to categories

Regression Problems

Continuous Value Prediction:

House Price Prediction: Estimate property values
Sales Forecasting: Predict future sales volumes
Demand Planning: Forecast product demand
Risk Scoring: Calculate probability scores

Clustering and Unsupervised Learning

Pattern Discovery:

Customer Segmentation: Group customers by behavior
Anomaly Detection: Identify unusual patterns
Topic Modeling: Discover themes in text data
Recommendation Systems: Group similar items/users

Specialized Applications

Time Series Forecasting:

Stock price prediction
Weather forecasting
Capacity planning
Trend analysis

Natural Language Processing:

Text classification and sentiment analysis
Machine translation
Chatbots and conversational AI
Document summarization

Machine Learning Model Benefits and Capabilities

ML models offer significant advantages over traditional rule-based systems, enabling automation, scalability, and continuous improvement.

Pattern Recognition at Scale

Complex Pattern Discovery: ML algorithms can identify intricate patterns in high-dimensional data that would be impossible for humans to detect manually.

# Example: Identifying complex customer behavior patterns
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Customer transaction data
data = pd.DataFrame({
    'purchase_frequency': [5, 2, 8, 1, 6, 3, 9, 4, 7, 2],
    'avg_order_value': [150, 75, 300, 50, 200, 100, 400, 125, 250, 80],
    'browsing_time': [120, 45, 180, 30, 150, 60, 210, 90, 170, 40]
})

# Standardize features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

# Add cluster labels to original data
data['customer_segment'] = clusters
print(data.groupby('customer_segment').mean())

Minimizing Human Error and Bias

Consistent Decision Making: ML models apply the same criteria consistently, reducing subjective judgment and human error.

Bias Detection and Mitigation: Modern ML frameworks include tools for detecting and addressing algorithmic bias.

# Example: Bias detection in model predictions
from sklearn.metrics import confusion_matrix
import numpy as np

# Model predictions vs actual outcomes
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 1])  # Actual labels
y_pred = np.array([0, 1, 1, 1, 0, 0, 0, 1, 1, 1])  # Model predictions

# Confusion matrix analysis
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Calculate bias metrics
tn, fp, fn, tp = cm.ravel()
false_positive_rate = fp / (fp + tn)
false_negative_rate = fn / (fn + tp)

print(f"False Positive Rate: {false_positive_rate:.3f}")
print(f"False Negative Rate: {false_negative_rate:.3f}")

Continuous Learning and Adaptation

Model Improvement Over Time: ML models can be retrained with new data to maintain and improve performance.

Online Learning: Some algorithms support incremental learning without full retraining.

# Example: Online learning with SGDClassifier
from sklearn.linear_model import SGDClassifier
import numpy as np

# Initialize online learner
clf = SGDClassifier(loss='log', random_state=42)

# Simulate streaming data
for batch in range(10):
    # Generate batch of training data
    X_batch = np.random.randn(100, 4)
    y_batch = (X_batch[:, 0] + X_batch[:, 1] > 0).astype(int)

    # Partial fit on batch
    clf.partial_fit(X_batch, y_batch, classes=[0, 1])

    # Evaluate current performance
    accuracy = clf.score(X_batch, y_batch)
    print(f"Batch {batch + 1} Accuracy: {accuracy:.3f}")

Handling Diverse Data Types

Multimodal Data Processing: ML can process various data types simultaneously:

Structured Data: Tabular data, databases
Unstructured Data: Text, images, audio, video
Time Series Data: Sequential data with temporal dependencies
Graph Data: Network and relationship data

Building Quality Datasets: Best Practices

Creating high-quality datasets is fundamental to successful ML projects. Poor data quality leads to poor model performance, regardless of algorithm sophistication.

Data Collection Strategies

Comprehensive Data Sources:

Internal databases and transaction logs
External APIs and third-party data providers
User-generated content and social media
IoT sensors and device telemetry
Public datasets and research repositories

Data Volume Considerations:

Minimum Viable Data: Enough samples for initial model training
Statistical Significance: Sufficient data for reliable performance estimates
Long-tail Phenomena: Adequate representation of rare events
Temporal Coverage: Data spanning relevant time periods

Data Quality Assessment Framework

Completeness:

Missing value analysis
Null value patterns
Data sparsity evaluation

# Data completeness analysis
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('customer_data.csv')

# Missing value summary
print("Missing Value Summary:")
print(df.isnull().sum())

# Visualize missing patterns
msno.matrix(df)
plt.title('Missing Value Patterns')
plt.show()

# Missing value percentages
missing_percentages = (df.isnull().sum() / len(df)) * 100
print("\nMissing Value Percentages:")
print(missing_percentages[missing_percentages > 0])

Accuracy:

Cross-validation with known standards
Outlier detection and validation
Business rule compliance checking

Consistency:

Format standardization
Unit normalization
Categorical value harmonization

Timeliness:

Data freshness assessment
Temporal consistency checks
Staleness impact evaluation

Data Preparation Workflow

Data Cleaning:

# Comprehensive data cleaning pipeline
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

def clean_dataset(df):
    """Clean and prepare dataset for ML modeling."""

    # Remove duplicate rows
    df = df.drop_duplicates()

    # Handle missing values
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    categorical_cols = df.select_dtypes(include=['object']).columns

    # Numeric imputation
    numeric_imputer = SimpleImputer(strategy='median')
    df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])

    # Categorical imputation
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])

    # Outlier treatment (IQR method for numeric columns)
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Cap outliers
        df[col] = np.clip(df[col], lower_bound, upper_bound)

    return df

# Apply cleaning
cleaned_df = clean_dataset(df)

Feature Engineering:

# Feature engineering examples
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer

def engineer_features(df):
    """Create new features from existing data."""

    # Date-based features
    if 'transaction_date' in df.columns:
        df['transaction_date'] = pd.to_datetime(df['transaction_date'])
        df['transaction_day'] = df['transaction_date'].dt.day
        df['transaction_month'] = df['transaction_date'].dt.month
        df['transaction_year'] = df['transaction_date'].dt.year
        df['is_weekend'] = df['transaction_date'].dt.weekday >= 5

    # Interaction features
    if 'quantity' in df.columns and 'unit_price' in df.columns:
        df['total_value'] = df['quantity'] * df['unit_price']

    # Binning continuous variables
    if 'age' in df.columns:
        discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
        df['age_group'] = discretizer.fit_transform(df[['age']])

    # Polynomial features
    numeric_cols = ['age', 'income', 'total_value']
    existing_cols = [col for col in numeric_cols if col in df.columns]

    if len(existing_cols) >= 2:
        poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
        poly_features = poly.fit_transform(df[existing_cols])
        poly_feature_names = poly.get_feature_names_out(existing_cols)

        # Add polynomial features to dataframe
        for i, name in enumerate(poly_feature_names[len(existing_cols):]):
            df[f'poly_{name}'] = poly_features[:, len(existing_cols) + i]

    return df

# Apply feature engineering
featured_df = engineer_features(cleaned_df)

Correlation Analysis and Feature Selection

Understanding relationships between variables is crucial for effective model building and feature selection.

Correlation Matrix Analysis

Pearson Correlation: Measures linear relationships between continuous variables.

# Correlation analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr

# Calculate correlation matrix
correlation_matrix = df.corr(method='pearson')

# Visualize correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Identify highly correlated features
high_corr_threshold = 0.8
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > high_corr_threshold:
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

print("Highly Correlated Feature Pairs:")
for pair in high_corr_pairs:
    print(f"{pair[0]} - {pair[1]}: {pair[2]:.3f}")

Spearman Rank Correlation: Measures monotonic relationships, useful for ordinal data.

# Spearman correlation for non-parametric relationships
spearman_corr = df.corr(method='spearman')

# Compare Pearson vs Spearman
comparison = pd.DataFrame({
    'Pearson': correlation_matrix.values.flatten(),
    'Spearman': spearman_corr.values.flatten()
})

# Plot comparison
plt.figure(figsize=(8, 6))
plt.scatter(comparison['Pearson'], comparison['Spearman'], alpha=0.5)
plt.xlabel('Pearson Correlation')
plt.ylabel('Spearman Correlation')
plt.title('Pearson vs Spearman Correlation Comparison')
plt.plot([-1, 1], [-1, 1], 'r--', alpha=0.7)
plt.grid(True, alpha=0.3)
plt.show()

Feature Selection Techniques

Filter Methods:

# Univariate feature selection
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.datasets import make_regression

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]

# F-test based selection
selector_f = SelectKBest(score_func=f_regression, k=10)
X_selected_f = selector_f.fit_transform(X, y)

# Get selected feature indices and scores
selected_indices = selector_f.get_support(indices=True)
selected_scores = selector_f.scores_[selected_indices]

print("Top 10 features by F-test:")
for idx, score in zip(selected_indices, selected_scores):
    print(f"{feature_names[idx]}: {score:.3f}")

# Mutual information based selection
selector_mi = SelectKBest(score_func=mutual_info_regression, k=10)
X_selected_mi = selector_mi.fit_transform(X, y)

selected_indices_mi = selector_mi.get_support(indices=True)
selected_scores_mi = selector_mi.scores_[selected_indices_mi]

print("\nTop 10 features by Mutual Information:")
for idx, score in zip(selected_indices_mi, selected_scores_mi):
    print(f"{feature_names[idx]}: {score:.3f}")

Wrapper Methods:

# Recursive Feature Elimination
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# RFE with linear regression
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=10, step=1)
selector = selector.fit(X, y)

# Get selected features
selected_features = [feature_names[i] for i in range(len(feature_names)) if selector.support_[i]]
ranking = selector.ranking_

print("Selected features by RFE:")
for feature, rank in zip(feature_names, ranking):
    if rank == 1:
        print(f"✓ {feature}")
    else:
        print(f"✗ {feature} (rank: {rank})")

Decision Trees: Interpretable ML Models

Decision trees provide transparent, interpretable models that mimic human decision-making processes.

Decision Tree Fundamentals

Tree Structure:

Root Node: Starting point with the most important feature
Internal Nodes: Decision points based on feature values
Leaf Nodes: Final predictions or classifications

Advantages:

Easy to understand and interpret
Handle both numerical and categorical data
Require minimal data preprocessing
Can capture non-linear relationships

# Decision tree implementation
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                          n_redundant=2, random_state=42)

# Train decision tree
dt_classifier = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_classifier.fit(X, y)

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(dt_classifier, feature_names=[f'X{i}' for i in range(10)],
          class_names=['Class 0', 'Class 1'], filled=True, rounded=True)
plt.title('Decision Tree Visualization')
plt.show()

# Feature importance
feature_importance = pd.DataFrame({
    'feature': [f'X{i}' for i in range(10)],
    'importance': dt_classifier.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

Decision Tree Hyperparameters

Tree Complexity Control:

max_depth: Maximum tree depth
min_samples_split: Minimum samples required to split a node
min_samples_leaf: Minimum samples required in a leaf node
max_features: Number of features to consider for best split

Pruning Parameters:

ccp_alpha: Cost complexity pruning parameter
min_impurity_decrease: Minimum impurity decrease for splits

XGBoost: Gradient Boosting Powerhouse

XGBoost (Extreme Gradient Boosting) is a highly efficient, scalable implementation of gradient boosting framework.

XGBoost Architecture

Core Algorithm:

Ensemble of weak learners (typically decision trees)
Sequential training with each tree correcting previous errors
Gradient descent optimization in functional space

Key Innovations:

Regularization to prevent overfitting
Weighted quantile sketch for approximate tree learning
Sparsity-aware split finding
Parallel and distributed computing

# XGBoost implementation
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Generate sample data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                          n_redundant=10, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to DMatrix (XGBoost's internal data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {
    'objective': 'binary:logistic',
    'max_depth': 6,
    'eta': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'auc'
}

# Train model
bst = xgb.train(params, dtrain, num_boost_round=100,
                evals=[(dtrain, 'train'), (dtest, 'test')],
                early_stopping_rounds=10, verbose_eval=False)

# Make predictions
y_pred_prob = bst.predict(dtest)
y_pred = (y_pred_prob > 0.5).astype(int)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
xgb.plot_importance(bst, max_num_features=10)
plt.title('XGBoost Feature Importance')
plt.show()

XGBoost Hyperparameter Optimization

Core Parameters:

eta (learning_rate): Step size shrinkage (0.01-0.3)
max_depth: Maximum tree depth (3-10)
subsample: Row subsampling ratio (0.5-1.0)
colsample_bytree: Column subsampling ratio (0.5-1.0)

Regularization Parameters:

alpha (reg_alpha): L1 regularization
lambda (reg_lambda): L2 regularization
gamma: Minimum loss reduction for split

# Hyperparameter tuning with grid search
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'eta': [0.01, 0.1, 0.3],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

# XGBoost classifier for sklearn API
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100, random_state=42)

# Grid search
grid_search = GridSearchCV(xgb_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Train final model with best parameters
best_model = grid_search.best_estimator_
final_predictions = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, final_predictions)
print(f"Final test accuracy: {final_accuracy:.4f}")

Model Deployment Strategies with Amazon SageMaker

Inference Options

Real-Time Inference:

Synchronous predictions with low latency
Ideal for user-facing applications
Auto-scaling based on traffic patterns

Batch Inference:

Asynchronous processing of large datasets
Cost-effective for periodic predictions
Suitable for offline analysis and reporting

Asynchronous Inference:

Queue-based processing for large payloads
Near real-time results with high throughput
Built-in retry mechanisms and error handling

Choosing Deployment Options

Decision Framework:

Latency Requirements: Real-time (<100ms) vs batch processing
Throughput Needs: Requests per second capacity
Cost Optimization: Compute resource utilization
Scalability: Traffic pattern variability
Integration: Existing system compatibility

# SageMaker real-time endpoint deployment
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Initialize SageMaker session
sagemaker_client = boto3.client('sagemaker')
role = get_execution_role()

# Create model
model = Model(
    model_data='s3://my-bucket/models/xgb-model.tar.gz',
    image_uri='683313688378.dkr.ecr.us-east-1.amazonaws/sagemaker-xgboost:1.5-1',
    role=role,
    sagemaker_session=sagemaker_session
)

# Deploy to endpoint
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='customer-churn-endpoint',
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

# Make real-time predictions
test_data = [[5.1, 3.5, 1.4, 0.2], [6.2, 3.4, 5.4, 2.3]]  # Example features
predictions = predictor.predict(test_data)
print(f"Predictions: {predictions}")

# Clean up
predictor.delete_endpoint()

ML Lifecycle and MLOps Integration

End-to-End ML Workflow

Development Phase:

Problem definition and data collection
Exploratory data analysis and feature engineering
Model development and validation
Performance optimization

Production Phase:

Model deployment and serving
Monitoring and alerting
Performance tracking and drift detection
Continuous improvement and retraining

MLOps Methodology

People, Process, Technology Integration:

People: Cross-functional teams with shared ownership
Process: Standardized workflows and best practices
Technology: Automated pipelines and tooling

Workflow and Roles:

Data Scientists: Model development and experimentation
ML Engineers: Pipeline development and deployment
DevOps Engineers: Infrastructure and monitoring
Business Stakeholders: Requirements and validation

SageMaker Studio: Unified ML Development Environment

Key Features and Capabilities

Integrated Development Environment:

Jupyter notebooks with pre-configured kernels
Built-in algorithm support and model deployment
Experiment tracking and model registry
Collaborative development and sharing

Architecture Overview:

JupyterApp Layer: User interface and notebook management
Kernel App Layer: Code execution and compute resources
Gateway Pattern: Secure communication between layers

# SageMaker Studio experiment tracking
import boto3
from sagemaker.session import Session
from sagemaker.experiments import Experiment, Trial, TrialComponent

# Initialize SageMaker session
sagemaker_session = Session()

# Create experiment
experiment = Experiment.create(
    experiment_name='customer-churn-experiment',
    description='Experiment to predict customer churn',
    sagemaker_session=sagemaker_session
)

# Create trial
trial = Trial.create(
    trial_name='xgboost-trial-1',
    experiment_name=experiment.experiment_name,
    sagemaker_session=sagemaker_session
)

# Log parameters and metrics
trial_component = TrialComponent.create(
    trial_component_name='xgboost-training',
    trial_name=trial.trial_name,
    experiment_name=experiment.experiment_name,
    sagemaker_session=sagemaker_session
)

# Log hyperparameters
trial_component.log_parameters({
    'max_depth': 6,
    'eta': 0.1,
    'objective': 'binary:logistic'
})

# Log metrics
trial_component.log_metrics({
    'train_auc': 0.85,
    'validation_auc': 0.82,
    'test_accuracy': 0.78
})

SageMaker Data Wrangler: No-Code Data Preparation

Key Capabilities

Data Ingestion:

Multiple source connectors (S3, Athena, Redshift, etc.)
Support for various data formats (CSV, JSON, Parquet)
Schema inference and data profiling

Data Quality Analysis:

Automatic anomaly detection
Missing value analysis and imputation
Statistical distribution analysis
Data validation rules

Data Transformation:

300+ built-in transformations
Custom transformation authoring
Visual transformation pipeline builder
Automated feature engineering

Model Impact Analysis:

Quick model accuracy estimation
Feature importance analysis
Bias detection and mitigation
Pre-deployment validation

Regression Model Evaluation Metrics

Error-Based Metrics

Mean Squared Error (MSE):

Average of squared prediction errors
Penalizes large errors more heavily
Sensitive to outliers

Root Mean Squared Error (RMSE):

Square root of MSE
Interpretable in original units
Commonly used in practice

# Regression metrics calculation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def evaluate_regression_model(y_true, y_pred):
    """Calculate comprehensive regression metrics."""

    # Error-based metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)

    # Goodness-of-fit metrics
    r2 = r2_score(y_true, y_pred)

    # Adjusted R-squared
    n = len(y_true)  # Number of observations
    p = 1  # Number of features (adjust based on your model)
    adjusted_r2 = 1 - (1 - r2) * ((n - 1) / (n - p - 1))

    # Mean Absolute Percentage Error
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    metrics = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'Adjusted R²': adjusted_r2,
        'MAPE': mape
    }

    return metrics

# Example usage
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([95, 155, 195, 255, 295])

metrics = evaluate_regression_model(y_true, y_pred)

print("Regression Model Evaluation:")
for metric, value in metrics.items():
    print(f"{metric}: {value:.4f}")

Goodness-of-Fit Metrics

R-squared (R²):

Proportion of variance explained by the model
Ranges from 0 to 1 (higher is better)
Can be misleading with overfitting

Adjusted R-squared:

Penalizes addition of irrelevant features
More reliable for model comparison
Always lower than or equal to R²

Practical Considerations

Metric Selection Guidelines:

MSE/RMSE: When large errors are particularly problematic
MAE: When all errors should be treated equally
R²: For understanding explained variance
MAPE: For percentage-based error interpretation

Business Context Integration:

Align metrics with business objectives
Consider cost-benefit trade-offs
Validate metrics against domain expertise

Conclusion: Building Production-Ready Data Science Solutions

Practical data science requires balancing technical excellence with business value delivery. By following systematic approaches to data preparation, model development, and deployment, organizations can build robust ML solutions that drive real business impact.

Key success factors include:

Quality Data Foundation: Rigorous data collection and preparation
Rigorous Evaluation: Comprehensive model validation and testing
Production Readiness: Scalable deployment and monitoring
Continuous Improvement: Ongoing optimization and adaptation

The tools and techniques covered in this guide provide a solid foundation for tackling real-world data science challenges. Remember that successful ML implementation requires iteration, experimentation, and continuous learning.

Comprehensive guide to practical data science, covering ML workflows, model development, deployment strategies, and production best practices.