API Reference#
Top-level package for SpectoPrep. SpectroPrep: A comprehensive toolkit for spectroscopic data preprocessing and modeling.
This package provides tools for preprocessing spectroscopic data, pipeline optimization, and modeling using Ridge regression.
- class spectoprep.OptimizedRidgeCV(alphas=None, cv=5, scoring='neg_mean_squared_error', fit_intercept=True, normalize=False, gcv_mode=None, store_cv_values=False, groups=None)[source]#
Bases:
BaseEstimator
,RegressorMixin
Ridge regression with built-in cross-validation and optimization capabilities.
Parameters#
- alphasarray-like, default=np.logspace(-3, 3, 10)
Array of alpha values to try. A large array of values will slow down the computation.
- cvint, cross-validation generator or an iterable, default=5
Determines the cross-validation splitting strategy.
- scoringstr, callable, default=’neg_mean_squared_error’
A string or a scorer callable object / function with signature
scorer(estimator, X, y)
.- fit_interceptbool, default=True
Whether to calculate the intercept for this model.
- normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
- gcv_mode{None, ‘auto’, ‘svd’, ‘eigen’}, default=None
Flag indicating which strategy to use when performing Generalized Cross-Validation.
- store_cv_valuesbool, default=False
Flag indicating if the cross-validation values corresponding to each alpha should be stored in the cv_values_ attribute.
- groupsarray-like, default=None
Group labels for the samples. Only used if cv is a group-based cross-validation splitter.
- fit(X, y, sample_weight=None)[source]#
Fit Ridge regression model with cross-validation.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Training data.
- yarray-like of shape (n_samples,) or (n_samples, n_targets)
Target values.
- sample_weightfloat or array-like of shape (n_samples,), default=None
Individual weights for each sample.
Returns#
- selfobject
Returns self.
- get_cv_results()[source]#
Return cross-validation results.
Returns#
- cv_resultsdict
Results from cross-validation.
- predict(X)[source]#
Predict using the Ridge model.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Samples.
Returns#
- y_predarray-like of shape (n_samples,) or (n_samples, n_targets)
Returns predicted values.
- score(X, y, sample_weight=None)[source]#
Return the coefficient of determination R^2 of the prediction.
Parameters#
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_targets)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
Returns#
- scorefloat
R^2 of self.predict(X) wrt. y.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters#
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns#
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV #
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters#
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
Returns#
- selfobject
The updated object.
- class spectoprep.PipelineOptimizer(X_train: ndarray[tuple[int, ...], dtype[_ScalarType_co]], y_train: ndarray[tuple[int, ...], dtype[_ScalarType_co]], preprocessing_steps: List[str] | None = None, X_test: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, y_test: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, cv_method: str = 'group_shuffle_split', n_splits: int = 3, test_size: float = 0.3, n_groups_out: int = 2, random_state: int = 42, groups: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, max_pipeline_length: int = 5, n_jobs: int = -1, allowed_preprocess_combinations: int | List[int] | Tuple[int, ...] | None = [1, 2], log_level: str = 'INFO')[source]#
Bases:
object
A class for optimizing machine learning pipelines using Bayesian optimization. It precomputes possible pipeline configurations and then searches over both the pipeline configuration (encoded as an index) and the hyperparameters.
- bayes_objective(**params) float [source]#
Objective function for Bayesian optimization.
- Args:
**params: Parameters to evaluate
- Returns:
float: Negative RMSE or penalty score on error
- bayesian_optimize(init_points: int = 10, n_iter: int = 50, acquisition_function: str = 'ei') Tuple[Dict, Pipeline] [source]#
Run Bayesian optimization to find the best pipeline configuration and hyperparameters.
- Args:
init_points: Number of random initial points n_iter: Number of Bayesian optimization iterations acquisition_function: Acquisition function for Bayesian optimization
- Returns:
- Tuple containing:
Dict of best parameters
Fitted Pipeline with best configuration
- export_best_pipeline(file_path: str) None [source]#
Export the best pipeline configuration and hyperparameters to a file.
- Args:
file_path: Path to save the export file
- Raises:
AttributeError: If optimizer hasn’t been run yet
- get_all_tested_pipelines() List[Dict] [source]#
Get details of all tested pipeline configurations.
- Returns:
List of dictionaries with pipeline details
- get_best_pipeline_predictions(best_pipeline: Pipeline) Tuple[ndarray[tuple[int, ...], dtype[_ScalarType_co]], float, float] [source]#
Get predictions using the best pipeline.
- Args:
best_pipeline: Fitted pipeline object
- Returns:
- Tuple containing:
Predictions array
RMSE score
R² score
- class spectoprep.SpectroPrepPlotter[source]#
Bases:
object
A class for creating high-quality plots for spectroscopy data.
This class provides various plotting functions specifically designed for spectroscopy data and pipeline optimization results.
- static plot_feature_importance(wavenumbers: ndarray, coefficients: ndarray, title: str = 'Feature Importance', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Coefficient Value', figsize: Tuple[int, int] = (12, 6), color: str = 'purple', highlight_threshold: float | None = None, highlight_color: str = 'red', save_path: str | None = None)[source]#
Plot feature importance from model coefficients.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- coefficientsarray-like
Model coefficients corresponding to each wavenumber.
- titlestr, default=’Feature Importance’
Plot title.
- xlabelstr, default=’Wavenumber (cm$^{-1}$)’
X-axis label.
- ylabelstr, default=’Coefficient Value’
Y-axis label.
- figsizetuple, default=(12, 6)
Figure size.
- colorstr, default=’purple’
Color of the line.
- highlight_thresholdfloat, optional
If provided, highlights coefficients with absolute values above this threshold.
- highlight_colorstr, default=’red’
Color for highlighted coefficients.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_optimization_progress(optimizer: PipelineOptimizer, figsize: Tuple[int, int] = (12, 6), title: str = 'Optimization Progress', save_path: str | None = None)[source]#
Plot optimization progress over iterations.
Parameters#
- optimizerPipelineOptimizer
The fitted pipeline optimizer.
- figsizetuple, default=(12, 6)
Figure size.
- titlestr, default=’Optimization Progress’
Plot title.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_optimization_results(optimizer: PipelineOptimizer, top_n: int = 5, figsize: Tuple[int, int] = (12, 8), title: str = 'Pipeline Optimization Results', save_path: str | None = None)[source]#
Plot optimization results from PipelineOptimizer.
Parameters#
- optimizerPipelineOptimizer
The fitted pipeline optimizer.
- top_nint, default=5
Number of top pipelines to display.
- figsizetuple, default=(12, 8)
Figure size.
- titlestr, default=’Pipeline Optimization Results’
Plot title.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- static plot_prediction_scatter(y_true: ndarray, y_pred: ndarray, title: str = 'Prediction Performance', xlabel: str = 'Measured', ylabel: str = 'Predicted', figsize: Tuple[int, int] = (10, 8), alpha: float = 0.7, color: str = 'blue', add_metrics: bool = True, save_path: str | None = None)[source]#
Create a scatter plot of predicted vs true values.
Parameters#
- y_truearray-like
True target values.
- y_predarray-like
Predicted target values.
- titlestr, default=’Prediction Performance’
Plot title.
- xlabelstr, default=’Measured’
X-axis label.
- ylabelstr, default=’Predicted’
Y-axis label.
- figsizetuple, default=(10, 8)
Figure size.
- alphafloat, default=0.7
Transparency of the points.
- colorstr, default=’blue’
Color of the scatter points.
- add_metricsbool, default=True
Whether to add RMSE and R² metrics to the plot.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.
- static plot_preprocessing_comparison(wavenumbers: ndarray, original_spectra: ndarray, processed_spectra: Dict[str, ndarray], sample_indices: List[int] | None = None, figsize: Tuple[int, int] = (15, 10), title: str = 'Preprocessing Comparison', color_map: str = 'tab10', save_path: str | None = None)[source]#
Plot comparison of original and processed spectra.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- original_spectraarray-like
The original spectra data of shape (n_samples, n_features).
- processed_spectradict
Dictionary mapping preprocessing method names to processed spectra.
- sample_indiceslist of int, optional
Indices of samples to plot. If None, all samples are plotted.
- figsizetuple, default=(15, 10)
Figure size.
- titlestr, default=’Preprocessing Comparison’
Main title for the figure.
- color_mapstr, default=’tab10’
Colormap for differentiating samples.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- static plot_spectra(wavenumbers: ndarray, spectra: ndarray, labels: List[str] | None = None, title: str = 'Spectral Data', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Absorbance', alpha: float = 0.7, figsize: Tuple[int, int] = (12, 6), color_map: str = 'viridis', legend_loc: str = 'best', grid: bool = True, save_path: str | None = None)[source]#
Plot spectral data.
Parameters#
- wavenumbersarray-like
The x-axis values (wavenumbers).
- spectraarray-like
The spectra data of shape (n_samples, n_features).
- labelslist of str, optional
Labels for each spectrum. If None, spectra are numbered.
- titlestr, default=’Spectral Data’
Plot title.
- xlabelstr, default=’Wavenumber (cm$^{-1}$)’
X-axis label.
- ylabelstr, default=’Absorbance’
Y-axis label.
- alphafloat, default=0.7
Transparency of the lines.
- figsizetuple, default=(12, 6)
Figure size.
- color_mapstr, default=’viridis’
Colormap for the spectra.
- legend_locstr, default=’best’
Location of the legend.
- gridbool, default=True
Whether to show grid.
- save_pathstr, optional
If provided, save the figure to this path.
Returns#
- figmatplotlib.figure.Figure
The figure object.
- axmatplotlib.axes.Axes
The axes object.