Version 1.5#
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.5.
Legend for changelogs
Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.5.0#
May 2024
Security#
Fix
feature_extraction.text.CountVectorizer
andfeature_extraction.text.TfidfVectorizer
no longer store discarded tokens from the training set in theirstop_words_
attribute. This attribute would hold too frequent (abovemax_df
) but also too rare tokens (belowmin_df
). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the
stop_words_
attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.
Changed models#
Efficiency The subsampling in
preprocessing.QuantileTransformer
is now more efficient for dense arrays but the fitted quantiles and the results oftransform
may be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.Enhancement
decomposition.PCA
,decomposition.SparsePCA
anddecomposition.TruncatedSVD
now set the sign of thecomponents_
attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across allPCA
solvers, including the newsvd_solver="covariance_eigh"
option introduced in this release.
Changes impacting many modules#
Fix Raise
ValueError
with an informative error message when passing 1D sparse arrays to methods that expect 2D sparse inputs. #28988 by Olivier Grisel.API Change The name of the input of the
inverse_transform
method of estimators has been standardized toX
. As a consequence,Xt
is deprecated and will be removed in version 1.7 in the following estimators:cluster.FeatureAgglomeration
,decomposition.MiniBatchNMF
,decomposition.NMF
,model_selection.GridSearchCV
,model_selection.RandomizedSearchCV
,pipeline.Pipeline
andpreprocessing.KBinsDiscretizer
. #28756 by Will Dean.
Support for Array API#
Additional estimators and functions have been updated to include support for all Array API compliant inputs.
See Array API support (experimental) for more details.
Functions:
sklearn.metrics.r2_score
now supports Array API compliant inputs. #27904 by Eric Lindgren, Franck Charras, Olivier Grisel and Tim Head.
Classes:
linear_model.Ridge
now supports the Array API for thesvd
solver. See Array API support (experimental) for more details. #27800 by Franck Charras, Olivier Grisel and Tim Head.
Support for building with Meson#
From scikit-learn 1.5 onwards, Meson is the main supported way to build scikit-learn, see Building from source for more details.
Unless we discover a major blocker, setuptools support will be dropped in scikit-learn 1.6. The 1.5.x releases will support building scikit-learn with setuptools.
Meson support for building scikit-learn was added in #28040 by Loïc Estève
Metadata Routing#
The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.
Feature
impute.IterativeImputer
now supports metadata routing in itsfit
method. #28187 by Stefanie Senger.Feature
ensemble.BaggingClassifier
andensemble.BaggingRegressor
now support metadata routing. The fit methods now accept**fit_params
which are passed to the underlying estimators via theirfit
methods. #28432 by Adam Li and Benjamin Bossan.Feature
linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
now support metadata routing in theirfit
method and route metadata to the underlyingmodel_selection.GridSearchCV
object or the underlying scorer. #27560 by Omar Salman.Feature
GraphicalLassoCV
now supports metadata routing in it’sfit
method and routes metadata to the CV splitter. #27566 by Omar Salman.Feature
linear_model.RANSACRegressor
now supports metadata routing in itsfit
,score
andpredict
methods and route metadata to its underlying estimator’s’fit
,score
andpredict
methods. #28261 by Stefanie Senger.Feature
ensemble.VotingClassifier
andensemble.VotingRegressor
now support metadata routing and pass**fit_params
to the underlying estimators via theirfit
methods. #27584 by Stefanie Senger.Feature
pipeline.FeatureUnion
now supports metadata routing in itsfit
andfit_transform
methods and route metadata to the underlying transformers’fit
andfit_transform
. #28205 by Stefanie Senger.Fix Fix an issue when resolving default routing requests set via class attributes. #28435 by Adrin Jalali.
Fix Fix an issue when
set_{method}_request
methods are used as unbound methods, which can happen if one tries to decorate them. #28651 by Adrin Jalali.Fix Prevent a
RecursionError
when estimators with the defaultscoring
param (None
) route metadata. #28712 by Stefanie Senger.
Changelog#
sklearn.calibration
#
Fix Fixed a regression in
calibration.CalibratedClassifierCV
where an error was wrongly raised with string targets. #28843 by Jérémie du Boisberranger.
sklearn.cluster
#
Fix The
cluster.MeanShift
class now properly converges for constant data. #28951 by Akihiro Kuno.Fix Create copy of precomputed sparse matrix within the
fit
method ofOPTICS
to avoid in-place modification of the sparse matrix. #28491 by Thanh Lam Dang.Fix
cluster.HDBSCAN
now supports all metrics supported bysklearn.metrics.pairwise_distances
whenalgorithm="brute"
or"auto"
. #28664 by Manideep Yenugula.
sklearn.compose
#
Feature A fitted
compose.ColumnTransformer
now implements__getitem__
which returns the fitted transformers by name. #27990 by Thomas Fan.Enhancement
compose.TransformedTargetRegressor
now raises an error infit
if onlyinverse_func
is provided withoutfunc
(that would default to identity) being explicitly set as well. #28483 by Stefanie Senger.Enhancement
compose.ColumnTransformer
can now expose the “remainder” columns in the fittedtransformers_
attribute as column names or boolean masks, rather than column indices. #27657 by Jérôme Dockès.Fix Fixed an bug in
compose.ColumnTransformer
withn_jobs > 1
, where the intermediate selected columns were passed to the transformers as read-only arrays. #28822 by Jérémie du Boisberranger.
sklearn.cross_decomposition
#
Fix The
coef_
fitted attribute ofcross_decomposition.PLSRegression
now takes into account both the scale ofX
andY
whenscale=True
. Note that the previous predicted values were not affected by this bug. #28612 by Guillaume Lemaitre.API Change Deprecates
Y
in favor ofy
in the methods fit, transform and inverse_transform of:cross_decomposition.PLSRegression
.cross_decomposition.PLSCanonical
,cross_decomposition.CCA
, andcross_decomposition.PLSSVD
.Y
will be removed in version 1.7. #28604 by David Leon.
sklearn.datasets
#
Enhancement Adds optional arguments
n_retries
anddelay
to functionsdatasets.fetch_20newsgroups
,datasets.fetch_20newsgroups_vectorized
,datasets.fetch_california_housing
,datasets.fetch_covtype
,datasets.fetch_kddcup99
,datasets.fetch_lfw_pairs
,datasets.fetch_lfw_people
,datasets.fetch_olivetti_faces
,datasets.fetch_rcv1
, anddatasets.fetch_species_distributions
. By default, the functions will retry up to 3 times in case of network failures. #28160 by Zhehao Liu and Filip Karlo Došilović.
sklearn.decomposition
#
Efficiency
decomposition.PCA
withsvd_solver="full"
now assigns a contiguouscomponents_
attribute instead of an non-contiguous slice of the singular vectors. Whenn_components << n_features
, this can save some memory and, more importantly, help speed-up subsequent calls to thetransform
method by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.Enhancement
PCA
now automatically selects the ARPACK solver for sparse inputs whensvd_solver="auto"
instead of raising an error. #28498 by Thanh Lam Dang.Enhancement
decomposition.PCA
now supports a new solver option namedsvd_solver="covariance_eigh"
which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say,n_samples >> 1000 > n_features
). Thesvd_solver="auto"
option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.Fix
decomposition.PCA
fit withsvd_solver="arpack"
,whiten=True
and a value forn_components
that is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.
sklearn.dummy
#
Enhancement
dummy.DummyClassifier
anddummy.DummyRegressor
now have then_features_in_
andfeature_names_in_
attributes afterfit
. #27937 by Marco vd Boom.
sklearn.ensemble
#
Efficiency Improves runtime of
predict
ofensemble.HistGradientBoostingClassifier
by avoiding to callpredict_proba
. #27844 by Christian Lorentzen.Efficiency
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
are now a tiny bit faster by pre-sorting the data before finding the thresholds for binning. #28102 by Christian Lorentzen.Fix Fixes a bug in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
whenmonotonic_cst
is specified for non-categorical features. #28925 by Xiao Yuan.
sklearn.feature_extraction
#
Efficiency
feature_extraction.text.TfidfTransformer
is now faster and more memory-efficient by using a NumPy vector instead of a sparse matrix for storing the inverse document frequency. #18843 by Paolo Montesel.Enhancement
feature_extraction.text.TfidfTransformer
now preserves the data type of the input matrix if it isnp.float64
ornp.float32
. #28136 by Guillaume Lemaitre.
sklearn.feature_selection
#
Enhancement
feature_selection.mutual_info_regression
andfeature_selection.mutual_info_classif
now supportn_jobs
parameter. #28085 by Neto Menoci and Florin Andrei.Enhancement The
cv_results_
attribute offeature_selection.RFECV
has a new key,n_features
, containing an array with the number of features selected at each step. #28670 by Miguel Silva.
sklearn.impute
#
Enhancement
impute.SimpleImputer
now supports custom strategies by passing a function in place of a strategy name. #28053 by Mark Elliot.
sklearn.inspection
#
Fix
inspection.DecisionBoundaryDisplay.from_estimator
no longer warns about missing feature names when provided apolars.DataFrame
. #28718 by Patrick Wang.
sklearn.linear_model
#
Enhancement Solver
"newton-cg"
inlinear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now emits information whenverbose
is set to positive values. #27526 by Christian Lorentzen.Fix
linear_model.ElasticNet
,linear_model.ElasticNetCV
,linear_model.Lasso
andlinear_model.LassoCV
now explicitly don’t accept large sparse data formats. #27576 by Stefanie Senger.Fix
linear_model.RidgeCV
andRidgeClassifierCV
correctly passsample_weight
to the underlying scorer whencv
is None. #27560 by Omar Salman.Fix
n_nonzero_coefs_
attribute inlinear_model.OrthogonalMatchingPursuit
will now always beNone
whentol
is set, asn_nonzero_coefs
is ignored in this case. #28557 by Lucy Liu.API Change
linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
will now allowalpha=0
whencv != None
, which is consistent withlinear_model.Ridge
andlinear_model.RidgeClassifier
. #28425 by Lucy Liu.API Change Passing
average=0
to disable averaging is deprecated inlinear_model.PassiveAggressiveClassifier
,linear_model.PassiveAggressiveRegressor
,linear_model.SGDClassifier
,linear_model.SGDRegressor
andlinear_model.SGDOneClassSVM
. Passaverage=False
instead. #28582 by Jérémie du Boisberranger.API Change Parameter
multi_class
was deprecated inlinear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
.multi_class
will be removed in 1.7, and internally, for 3 and more classes, it will always use multinomial. If you still want to use the one-vs-rest scheme, you can useOneVsRestClassifier(LogisticRegression(..))
. #28703 by Christian Lorentzen.API Change
store_cv_values
andcv_values_
are deprecated in favor ofstore_cv_results
andcv_results_
in~linear_model.RidgeCV
and~linear_model.RidgeClassifierCV
. #28915 by Lucy Liu.
sklearn.manifold
#
API Change Deprecates
n_iter
in favor ofmax_iter
inmanifold.TSNE
.n_iter
will be removed in version 1.7. This makesmanifold.TSNE
consistent with the rest of the estimators. #28471 by Lucy Liu
sklearn.metrics
#
Feature
metrics.pairwise_distances
accepts calculating pairwise distances for non-numeric arrays as well. This is supported through custom metrics only. #27456 by Venkatachalam N, Kshitij Mathur and Julian Libiseller-Egger.Feature
sklearn.metrics.check_scoring
now returns a multi-metric scorer whenscoring
as adict
,set
,tuple
, orlist
. #28360 by Thomas Fan.Feature
metrics.d2_log_loss_score
has been added which calculates the D^2 score for the log loss. #28351 by Omar Salman.Efficiency Improve efficiency of functions
brier_score_loss
,calibration_curve
,det_curve
,precision_recall_curve
,roc_curve
whenpos_label
argument is specified. Also improve efficiency of methodsfrom_estimator
andfrom_predictions
inRocCurveDisplay
,PrecisionRecallDisplay
,DetCurveDisplay
,CalibrationDisplay
. #28051 by Pierre de Fréminville.Fix
metrics.classification_report
now shows only accuracy and not micro-average when input is a subset of labels. #28399 by Vineet Joshi.Fix Fix OpenBLAS 0.3.26 dead-lock on Windows in pairwise distances computation. This is likely to affect neighbor-based algorithms. #28692 by Loïc Estève.
API Change
metrics.precision_recall_curve
deprecated the keyword argumentprobas_pred
in favor ofy_score
.probas_pred
will be removed in version 1.7. #28092 by Adam Li.API Change
metrics.brier_score_loss
deprecated the keyword argumenty_prob
in favor ofy_proba
.y_prob
will be removed in version 1.7. #28092 by Adam Li.API Change For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.
sklearn.mixture
#
Fix The
converged_
attribute ofmixture.GaussianMixture
andmixture.BayesianGaussianMixture
now reflects the convergence status of the best fit whereas it was previouslyTrue
if any of the fits converged. #26837 by Krsto Proroković.
sklearn.model_selection
#
Major Feature
model_selection.TunedThresholdClassifierCV
finds the decision threshold of a binary classifier that maximizes a classification metric through cross-validation.model_selection.FixedThresholdClassifier
is an alternative when one wants to use a fixed decision threshold without any tuning scheme. #26120 by Guillaume Lemaitre.Enhancement CV splitters that ignores the group parameter now raises a warning when groups are passed in to split. #28210 by Thomas Fan.
Enhancement The HTML diagram representation of
GridSearchCV
,RandomizedSearchCV
,HalvingGridSearchCV
, andHalvingRandomSearchCV
will show the best estimator whenrefit=True
. #28722 by Yao Xiao and Thomas Fan.Fix the
cv_results_
attribute (ofmodel_selection.GridSearchCV
) now returns masked arrays of the appropriate NumPy dtype, as opposed to always returning dtypeobject
. #28352 by Marco Gorelli.Fix
model_selection.train_test_split
works with Array API inputs. Previously indexing was not handled correctly leading to exceptions when using strict implementations of the Array API like CuPY. #28407 by Tim Head.
sklearn.multioutput
#
Enhancement
chain_method
parameter added tomultioutput.ClassifierChain
. #27700 by Lucy Liu.
sklearn.neighbors
#
Fix Fixes
neighbors.NeighborhoodComponentsAnalysis
such thatget_feature_names_out
returns the correct number of feature names. #28306 by Brendan Lu.
sklearn.pipeline
#
Feature
pipeline.FeatureUnion
can now use theverbose_feature_names_out
attribute. IfTrue
,get_feature_names_out
will prefix all feature names with the name of the transformer that generated that feature. IfFalse
,get_feature_names_out
will not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.
sklearn.preprocessing
#
Enhancement
preprocessing.QuantileTransformer
andpreprocessing.quantile_transform
now supports disabling subsampling explicitly. #27636 by Ralph Urlus.
sklearn.tree
#
Enhancement Plotting trees in matplotlib via
tree.plot_tree
now show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.
sklearn.utils
#
Fix
_safe_indexing
now works correctly for polars DataFrame whenaxis=0
and supports indexing polars Series. #28521 by Yao Xiao.API Change
utils.IS_PYPY
is deprecated and will be removed in version 1.7. #28768 by Jérémie du Boisberranger.API Change
utils.tosequence
is deprecated and will be removed in version 1.7. #28763 by Jérémie du Boisberranger.API Change
utils.parallel_backend
andutils.register_parallel_backend
are deprecated and will be removed in version 1.7. Usejoblib.parallel_backend
andjoblib.register_parallel_backend
instead. #28847 by Jérémie du Boisberranger.API Change Raise informative warning message in
type_of_target
when represented as bytes. For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.API Change
utils.estimator_checks.check_estimator_sparse_data
was split into two functions:utils.estimator_checks.check_estimator_sparse_matrix
andutils.estimator_checks.check_estimator_sparse_array
. #27576 by Stefanie Senger.
Code and documentation contributors
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:
101AlexMartin, Abdulaziz Aloqeely, Adam J. Stewart, Adam Li, Adarsh Wase, Adrin Jalali, Advik Sinha, Akash Srivastava, Akihiro Kuno, Alan Guedes, Alexis IMBERT, Ana Paula Gomes, Anderson Nelson, Andrei Dzis, Arnaud Capitaine, Arturo Amor, Aswathavicky, Bharat Raghunathan, Brendan Lu, Bruno, Cemlyn, Christian Lorentzen, Christian Veenhuis, Cindy Liang, Claudio Salvatore Arcidiacono, Connor Boyle, Conrad Stevens, crispinlogan, davidleon123, DerWeh, Dipan Banik, Duarte São José, DUONG, Eddie Bergman, Edoardo Abati, Egehan Gunduz, Emad Izadifar, Erich Schubert, Filip Karlo Došilović, Franck Charras, Gael Varoquaux, Gönül Aycı, Guillaume Lemaitre, Gyeongjae Choi, Harmanan Kohli, Hong Xiang Yue, Ian Faust, itsaphel, Ivan Wiryadi, Jack Bowyer, Javier Marin Tur, Jérémie du Boisberranger, Jérôme Dockès, Jiawei Zhang, Joel Nothman, Johanna Bayer, John Cant, John Hopfensperger, jpcars, jpienaar-tuks, Julian Libiseller-Egger, Julien Jerphanion, KanchiMoe, Kaushik Amar Das, keyber, Koustav Ghosh, kraktus, Krsto Proroković, ldwy4, LeoGrin, lihaitao, Linus Sommer, Loic Esteve, Lucy Liu, Lukas Geiger, manasimj, Manuel Labbé, Manuel Morales, Marco Edward Gorelli, Maren Westermann, Marija Vlajic, Mark Elliot, Mateusz Sokół, Mavs, Michael Higgins, Michael Mayer, miguelcsilva, Miki Watanabe, Mohammed Hamdy, myenugula, Nathan Goldbaum, Naziya Mahimkar, Neto, Olivier Grisel, Omar Salman, Patrick Wang, Pierre de Fréminville, Priyash Shah, Puneeth K, Rahil Parikh, raisadz, Raj Pulapakura, Ralf Gommers, Ralph Urlus, Randolf Scholz, Reshama Shaikh, Richard Barnes, Rodrigo Romero, Saad Mahmood, Salim Dohri, Sandip Dutta, SarahRemus, scikit-learn-bot, Shaharyar Choudhry, Shubham, sperret6, Stefanie Senger, Suha Siddiqui, Thanh Lam DANG, thebabush, Thomas J. Fan, Thomas Lazarus, Thomas Li, Tialo, Tim Head, Tuhin Sharma, VarunChaduvula, Vineet Joshi, virchan, Waël Boukhobza, Weyb, Will Dean, Xavier Beltran, Xiao Yuan, Xuefeng Xu, Yao Xiao