API Reference#

Top-level package for SpectoPrep. SpectroPrep: A comprehensive toolkit for spectroscopic data preprocessing and modeling.

This package provides tools for preprocessing spectroscopic data, pipeline optimization, and modeling using Ridge regression.

class spectoprep.OptimizedRidgeCV(alphas=None, cv=5, scoring='neg_mean_squared_error', fit_intercept=True, normalize=False, gcv_mode=None, store_cv_values=False, groups=None)[source]#

Bases: BaseEstimator, RegressorMixin

Ridge regression with built-in cross-validation and optimization capabilities.

Parameters#

alphasarray-like, default=np.logspace(-3, 3, 10)

Array of alpha values to try. A large array of values will slow down the computation.

cvint, cross-validation generator or an iterable, default=5

Determines the cross-validation splitting strategy.

scoringstr, callable, default=’neg_mean_squared_error’

A string or a scorer callable object / function with signature scorer(estimator, X, y).

fit_interceptbool, default=True

Whether to calculate the intercept for this model.

normalizebool, default=False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.

gcv_mode{None, ‘auto’, ‘svd’, ‘eigen’}, default=None

Flag indicating which strategy to use when performing Generalized Cross-Validation.

store_cv_valuesbool, default=False

Flag indicating if the cross-validation values corresponding to each alpha should be stored in the cv_values_ attribute.

groupsarray-like, default=None

Group labels for the samples. Only used if cv is a group-based cross-validation splitter.

fit(X, y, sample_weight=None)[source]#

Fit Ridge regression model with cross-validation.

Parameters#

Xarray-like of shape (n_samples, n_features)

Training data.

yarray-like of shape (n_samples,) or (n_samples, n_targets)

Target values.

sample_weightfloat or array-like of shape (n_samples,), default=None

Individual weights for each sample.

Returns#

selfobject

Returns self.

get_cv_results()[source]#

Return cross-validation results.

Returns#

cv_resultsdict

Results from cross-validation.

predict(X)[source]#

Predict using the Ridge model.

Parameters#

Xarray-like of shape (n_samples, n_features)

Samples.

Returns#

y_predarray-like of shape (n_samples,) or (n_samples, n_targets)

Returns predicted values.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination R^2 of the prediction.

Parameters#

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_targets)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns#

scorefloat

R^2 of self.predict(X) wrt. y.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters#

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns#

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OptimizedRidgeCV#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters#

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns#

selfobject

The updated object.

class spectoprep.PipelineOptimizer(X_train: ndarray[tuple[int, ...], dtype[_ScalarType_co]], y_train: ndarray[tuple[int, ...], dtype[_ScalarType_co]], preprocessing_steps: List[str] | None = None, X_test: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, y_test: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, cv_method: str = 'group_shuffle_split', n_splits: int = 3, test_size: float = 0.3, n_groups_out: int = 2, random_state: int = 42, groups: ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None = None, max_pipeline_length: int = 5, n_jobs: int = -1, allowed_preprocess_combinations: int | List[int] | Tuple[int, ...] | None = [1, 2], log_level: str = 'INFO')[source]#

Bases: object

A class for optimizing machine learning pipelines using Bayesian optimization. It precomputes possible pipeline configurations and then searches over both the pipeline configuration (encoded as an index) and the hyperparameters.

bayes_objective(**params) float[source]#

Objective function for Bayesian optimization.

Args:

**params: Parameters to evaluate

Returns:

float: Negative RMSE or penalty score on error

bayesian_optimize(init_points: int = 10, n_iter: int = 50, acquisition_function: str = 'ei') Tuple[Dict, Pipeline][source]#

Run Bayesian optimization to find the best pipeline configuration and hyperparameters.

Args:

init_points: Number of random initial points n_iter: Number of Bayesian optimization iterations acquisition_function: Acquisition function for Bayesian optimization

Returns:
Tuple containing:
  • Dict of best parameters

  • Fitted Pipeline with best configuration

export_best_pipeline(file_path: str) None[source]#

Export the best pipeline configuration and hyperparameters to a file.

Args:

file_path: Path to save the export file

Raises:

AttributeError: If optimizer hasn’t been run yet

get_all_tested_pipelines() List[Dict][source]#

Get details of all tested pipeline configurations.

Returns:

List of dictionaries with pipeline details

get_best_pipeline_predictions(best_pipeline: Pipeline) Tuple[ndarray[tuple[int, ...], dtype[_ScalarType_co]], float, float][source]#

Get predictions using the best pipeline.

Args:

best_pipeline: Fitted pipeline object

Returns:
Tuple containing:
  • Predictions array

  • RMSE score

  • R² score

print_evaluated_pipelines() None[source]#

Print details for all evaluated pipelines from the Bayesian optimizer.

This method assumes that bayesian_optimize() has been run and that self.optimizer exists.

summarize_optimization() Dict[source]#

Generate a summary of the optimization results.

Returns:

Dictionary containing optimization summary metrics

class spectoprep.SpectroPrepPlotter[source]#

Bases: object

A class for creating high-quality plots for spectroscopy data.

This class provides various plotting functions specifically designed for spectroscopy data and pipeline optimization results.

static plot_feature_importance(wavenumbers: ndarray, coefficients: ndarray, title: str = 'Feature Importance', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Coefficient Value', figsize: Tuple[int, int] = (12, 6), color: str = 'purple', highlight_threshold: float | None = None, highlight_color: str = 'red', save_path: str | None = None)[source]#

Plot feature importance from model coefficients.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

coefficientsarray-like

Model coefficients corresponding to each wavenumber.

titlestr, default=’Feature Importance’

Plot title.

xlabelstr, default=’Wavenumber (cm$^{-1}$)’

X-axis label.

ylabelstr, default=’Coefficient Value’

Y-axis label.

figsizetuple, default=(12, 6)

Figure size.

colorstr, default=’purple’

Color of the line.

highlight_thresholdfloat, optional

If provided, highlights coefficients with absolute values above this threshold.

highlight_colorstr, default=’red’

Color for highlighted coefficients.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_optimization_progress(optimizer: PipelineOptimizer, figsize: Tuple[int, int] = (12, 6), title: str = 'Optimization Progress', save_path: str | None = None)[source]#

Plot optimization progress over iterations.

Parameters#

optimizerPipelineOptimizer

The fitted pipeline optimizer.

figsizetuple, default=(12, 6)

Figure size.

titlestr, default=’Optimization Progress’

Plot title.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_optimization_results(optimizer: PipelineOptimizer, top_n: int = 5, figsize: Tuple[int, int] = (12, 8), title: str = 'Pipeline Optimization Results', save_path: str | None = None)[source]#

Plot optimization results from PipelineOptimizer.

Parameters#

optimizerPipelineOptimizer

The fitted pipeline optimizer.

top_nint, default=5

Number of top pipelines to display.

figsizetuple, default=(12, 8)

Figure size.

titlestr, default=’Pipeline Optimization Results’

Plot title.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

static plot_prediction_scatter(y_true: ndarray, y_pred: ndarray, title: str = 'Prediction Performance', xlabel: str = 'Measured', ylabel: str = 'Predicted', figsize: Tuple[int, int] = (10, 8), alpha: float = 0.7, color: str = 'blue', add_metrics: bool = True, save_path: str | None = None)[source]#

Create a scatter plot of predicted vs true values.

Parameters#

y_truearray-like

True target values.

y_predarray-like

Predicted target values.

titlestr, default=’Prediction Performance’

Plot title.

xlabelstr, default=’Measured’

X-axis label.

ylabelstr, default=’Predicted’

Y-axis label.

figsizetuple, default=(10, 8)

Figure size.

alphafloat, default=0.7

Transparency of the points.

colorstr, default=’blue’

Color of the scatter points.

add_metricsbool, default=True

Whether to add RMSE and R² metrics to the plot.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static plot_preprocessing_comparison(wavenumbers: ndarray, original_spectra: ndarray, processed_spectra: Dict[str, ndarray], sample_indices: List[int] | None = None, figsize: Tuple[int, int] = (15, 10), title: str = 'Preprocessing Comparison', color_map: str = 'tab10', save_path: str | None = None)[source]#

Plot comparison of original and processed spectra.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

original_spectraarray-like

The original spectra data of shape (n_samples, n_features).

processed_spectradict

Dictionary mapping preprocessing method names to processed spectra.

sample_indiceslist of int, optional

Indices of samples to plot. If None, all samples are plotted.

figsizetuple, default=(15, 10)

Figure size.

titlestr, default=’Preprocessing Comparison’

Main title for the figure.

color_mapstr, default=’tab10’

Colormap for differentiating samples.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

static plot_spectra(wavenumbers: ndarray, spectra: ndarray, labels: List[str] | None = None, title: str = 'Spectral Data', xlabel: str = 'Wavenumber (cm$^{-1}$)', ylabel: str = 'Absorbance', alpha: float = 0.7, figsize: Tuple[int, int] = (12, 6), color_map: str = 'viridis', legend_loc: str = 'best', grid: bool = True, save_path: str | None = None)[source]#

Plot spectral data.

Parameters#

wavenumbersarray-like

The x-axis values (wavenumbers).

spectraarray-like

The spectra data of shape (n_samples, n_features).

labelslist of str, optional

Labels for each spectrum. If None, spectra are numbered.

titlestr, default=’Spectral Data’

Plot title.

xlabelstr, default=’Wavenumber (cm$^{-1}$)’

X-axis label.

ylabelstr, default=’Absorbance’

Y-axis label.

alphafloat, default=0.7

Transparency of the lines.

figsizetuple, default=(12, 6)

Figure size.

color_mapstr, default=’viridis’

Colormap for the spectra.

legend_locstr, default=’best’

Location of the legend.

gridbool, default=True

Whether to show grid.

save_pathstr, optional

If provided, save the figure to this path.

Returns#

figmatplotlib.figure.Figure

The figure object.

axmatplotlib.axes.Axes

The axes object.

static set_style(style='whitegrid', context='paper', font_scale=1.2)[source]#

Set the visual style for the plots.

Parameters#

stylestr, default=’whitegrid’

The seaborn style.

contextstr, default=’paper’

The seaborn context.

font_scalefloat, default=1.2

The font scale.

Modules#