Scikit-learn Integration with Atlas
This module provides seamless integration between scikit-learn regression models and the Atlas optimization framework.
Overview
The integration consists of three main components:
SklearnModelWrapper: Wraps scikit-learn models to work with Atlas’s xarray-based interface
SklearnModelFactory: Factory pattern for creating model wrappers
ModelConfigBuilder: Configuration builder for model setup
Installation
# Assuming Atlas is already installed
pip install scikit-learn xarray pandas numpy
pip install xgboost lightgbm # Optional, for advanced models
Quick Start
1. Loading an Existing Model
from atlas import OptimizerFactory
from atlas.models.sklearn_wrapper import SklearnModelWrapper
# Load pre-trained model
model_wrapper = SklearnModelWrapper(
model_path='path/to/model.pkl',
feature_names=['tv_spend', 'digital_spend', 'radio_spend'],
target_name='revenue',
scaler_path='path/to/scaler.pkl' # Optional
)
# Create optimizer
optimizer = OptimizerFactory.create('scipy', model=model_wrapper)
# Define budget data
import xarray as xr
budget = xr.Dataset({
'tv_spend': xr.DataArray([100000]),
'digital_spend': xr.DataArray([200000]),
'radio_spend': xr.DataArray([50000])
})
# Optimize
constraints = {
'total_budget': 350000,
'bounds': {
'tv_spend': (50000, 200000),
'digital_spend': (100000, 300000),
'radio_spend': (25000, 100000)
}
}
result = optimizer.optimize(budget, constraints)
print(f"Optimal allocation: {result.optimal_budget}")
print(f"Expected outcome: {result.optimal_value}")
2. Using the Factory Pattern
from atlas.models.sklearn_factory import SklearnModelFactory
# Create from saved model
model_wrapper = SklearnModelFactory.create(
model_path='path/to/model.pkl',
feature_names=['feature1', 'feature2', 'feature3'],
scaler_path='path/to/scaler.pkl'
)
# Create new model
model_wrapper = SklearnModelFactory.create(
model_type='random_forest',
feature_names=['tv', 'digital', 'radio'],
model_params={
'n_estimators': 100,
'max_depth': 10,
'random_state': 42
}
)
3. Configuration-Based Setup
from atlas.models.sklearn_factory import ModelConfigBuilder
# Build configuration
config = (ModelConfigBuilder()
.model_path('models/trained_model.pkl')
.features(['tv', 'digital', 'radio', 'social'])
.target('conversions')
.scaler('models/scaler.pkl')
.dimensions(time_dim='week', channel_dim='channel')
.contribution_method('feature_importance')
.build()
)
# Save configuration
builder = ModelConfigBuilder()
# ... configure ...
builder.save('config.yaml')
# Load from configuration
model_wrapper = SklearnModelFactory.from_config('config.yaml')
Supported Models
The factory supports the following scikit-learn models out of the box:
Linear Models: LinearRegression, Ridge, Lasso, ElasticNet
Tree-Based: RandomForestRegressor, GradientBoostingRegressor, DecisionTreeRegressor
Other: SVR (Support Vector Regression)
External: XGBoost, LightGBM
Registering Custom Models
from sklearn.base import BaseEstimator
from atlas.models.sklearn_factory import SklearnModelFactory
class CustomModel(BaseEstimator):
def fit(self, X, y):
# Implementation
pass
def predict(self, X):
# Implementation
pass
# Register the model
SklearnModelFactory.register_model('custom', CustomModel)
# Use it
model_wrapper = SklearnModelFactory.create(
model_type='custom',
feature_names=['f1', 'f2']
)
Feature Contributions
The wrapper automatically handles feature contribution calculations based on model type:
Linear models: Uses coefficients
Tree-based models: Uses feature importances
Other models: Uses permutation importance or equal distribution
# Get contributions
budget_data = xr.Dataset({...})
contributions = model_wrapper.contributions(budget_data)
# Get feature importance
importance = model_wrapper.get_feature_importance()
print(f"Feature importance: {importance}")
Working with Time Series
For time-based optimization:
import pandas as pd
# Create time series data
dates = pd.date_range('2024-01-01', periods=12, freq='M')
budget = xr.Dataset({
'tv': xr.DataArray(
[100000] * 12,
dims=['time'],
coords={'time': dates}
),
'digital': xr.DataArray(
[200000] * 12,
dims=['time'],
coords={'time': dates}
)
})
# Wrapper with time dimension
model_wrapper = SklearnModelWrapper(
model_path='model.pkl',
feature_names=['tv', 'digital'],
time_dim='time'
)
# Optimize for each time period
for t in dates:
period_budget = budget.sel(time=t)
result = optimizer.optimize(period_budget, constraints)
print(f"{t}: {result.optimal_budget}")
Best Practices
1. Feature Scaling
Always use a scaler for better optimization results:
from sklearn.preprocessing import StandardScaler
# During training
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model.fit(X_scaled, y)
# Save both
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'scaler.pkl')
# Use in wrapper
model_wrapper = SklearnModelWrapper(
model_path='model.pkl',
scaler_path='scaler.pkl',
feature_names=feature_names
)
2. Model Validation
Validate your model before optimization:
# Check model type
model_wrapper.validate_input(budget_data)
# Test predictions
test_pred = model_wrapper.predict(budget_data)
print(f"Prediction shape: {test_pred.shape}")
print(f"Prediction range: [{test_pred.min():.2f}, {test_pred.max():.2f}]")
3. Contribution Methods
Choose the appropriate contribution method:
# For linear models
wrapper = SklearnModelWrapper(
model=linear_model,
contribution_method='coef'
)
# For tree-based models
wrapper = SklearnModelWrapper(
model=rf_model,
contribution_method='feature_importance'
)
# For black-box models
wrapper = SklearnModelWrapper(
model=complex_model,
contribution_method='permutation'
)
Troubleshooting
Common Issues
Missing Features Error
# Ensure all required features are present required = model_wrapper.feature_names provided = list(budget_data.data_vars) missing = set(required) - set(provided)
Dimension Mismatch
# Check dimensions match expectations print(f"Data dimensions: {list(budget_data.dims)}") print(f"Expected time dim: {model_wrapper.time_dim}")
Scaling Issues
# Ensure scaler matches training # If predictions seem off, check if scaling was applied during training
Advanced Usage
Multi-Objective Optimization
# Wrap multiple models
revenue_model = SklearnModelWrapper(model_path='revenue.pkl', ...)
cost_model = SklearnModelWrapper(model_path='cost.pkl', ...)
# Use with multi-objective optimizer
from atlas import MultiObjectiveOptimizer
optimizer = MultiObjectiveOptimizer(
models={'revenue': revenue_model, 'cost': cost_model},
objectives={
'revenue': {'direction': 'maximize', 'weight': 0.7},
'cost': {'direction': 'minimize', 'weight': 0.3}
}
)
Custom Preprocessing
class CustomSklearnWrapper(SklearnModelWrapper):
def _xarray_to_features(self, x: xr.Dataset) -> np.ndarray:
# Custom feature extraction
features = super()._xarray_to_features(x)
# Add derived features
tv_digital_interaction = features[:, 0] * features[:, 1]
features = np.column_stack([features, tv_digital_interaction])
return features
Performance Tips
Use appropriate model complexity - Simpler models often optimize faster
Cache predictions when doing multiple optimizations
Parallelize time series optimizations when possible
Profile your model to identify bottlenecks
Next Steps
Explore Atlas’s visualization tools for optimization results
Implement custom constraints for your business rules
Consider ensemble approaches for robust optimization
Integrate with your data pipeline for automated optimization