Changepoint detection#
The sdt.changepoint
module provides alogrithms for changepoint
detection, i.e. for finding changepoints in a time series.
There are several algorithms available:
PELT: a fast offline detection algorithm [Kill2012]. See the PELT section below for details.
Offline Bayesian changepoint detection [Fear2006]. See the appropriate section for further information.
Online Bayesian changepoint detection [Adam2007]. Details can be found below.
The segment_stats()
allows for calculating statistics for segments
identified via changepoint detection. For instance, the mean value, median,
etc. can be computed per segment.
Further, there is plot_changepoints()
for plotting time series
together with the detected changepoints.
Examples
Create some data:
>>> numpy.random.seed(123)
>>> # Make some data with a changepoint at t = 10
>>> data = numpy.concatenate([numpy.random.normal(1.5, 0.1, 10),
... numpy.random.normal(0, 0.1, 10)])
Use PELT for changepoint detection:
>>> det = Pelt(cost="l2", min_size=1, jump=1)
>>> det.find_changepoints(data, 1)
array([10])
Simple bayesian offline changepoint detection:
>>> det = BayesOffline("const", "gauss")
>>> det.find_changepoints(data, prob_threshold=0.4)
array([10])
Online changepoint detection can be used on data as it arrives.
>>> det = BayesOnline()
>>> while True:
>>> x = wait_for_data() # some routine that returns data once it arrives
>>> det.update(x)
>>> prob = det.get_probabilities(3) # look three data points into past
>>> if len(prob) >= 1 and np.any(prob[1:] > 0.8):
>>> # There is an 80% chance that there was changepoint
>>> break
Of course, it can be used also on a whole dataset similarly to offline detection.
>>> det = changepoint.BayesOnline()
>>> det.find_changepoints(data, 3, 0.5)
array([10])
All algorithms also work with multivariate data. In that case, data should be a 2D array with one dataset per column.
>>> # changepoint at t = 5
>>> data2 = numpy.concatenate([numpy.random.normal(0, 0.1, 5),
... numpy.random.normal(2.5, 0.1, 15)
>>> # multivariate data, one set per column
>>> data_m = numpy.array([data, data2]).T
PELT example:
>>> det = changepoint.Pelt(cost="l2", min_size=1, jump=1)
>>> det.find_changepoints(data_m, 1)
array([ 5, 10])
When using BayesOffline
, it is recommended to choose either the
“ifm” or the “full_cov” model for multivariate data.
PELT#
Detector class#
- class sdt.changepoint.Pelt(cost='l2', min_size=2, jump=1, cost_params={}, engine='numba')[source]#
PELT changepoint detection
Implementation of the PELT algorithm [Kill2012]. It is compatible with the implementation from ruptures.
Examples
>>> det = Pelt(cost="l2", min_size=1, jump=1) >>> # Make some data with a changepoint at t = 10 >>> data = np.concatenate([np.ones(10), np.zeros(10)]) >>> det.find_changepoints(data, 1) array([10])
- Parameters:
cost (cost class or cost class instance or str, optional) – If “l1”, use
CostL1
, “l2”, useCostL2
. A cost class type or instance can be passed directly.min_size (int, optional) – Minimum length of segments between change points. Defaults to 2.
jump (int, optional) – Consider only every jump-th data point to speed up calculation. Defaults to 5.
cost_params (dict, optional) – Parameters to pass to the cost class’s __init__. Only relevant if cost is not a class instance. Defaults to {}.
engine ({"python", "numba"}, optional) – If “numba”, use the numba-accelerated implementation. Defaults to “numba”.
- find_changepoints(data, penalty, max_exp_cp=10)[source]#
Do the changpoint detection on data
- Parameters:
data (numpy.ndarray, shape(n, m)) – m datasets of n data points
penalty (float) – Penalty of creating a new changepoint
max_exp_cp (int, optional) – Expected maximum number of changepoints. Memory for at least this many changepoints will be pre-allocated. If there are more, a larger array will be allocated, but this incurs some overhead. Defaults to 10.
- Returns:
Array of changepoints
- Return type:
numpy.ndarray, shape(n)
Cost classes#
- class sdt.changepoint.CostL1[source]#
L1 norm cost
The cost is \(\sum_i |y_i - \operatorname{median}(y)|\).
- min_size#
Minimum size of a segment that works with this cost function
- Type:
int
- class sdt.changepoint.CostL2[source]#
L2 norm cost
The cost is \(\operatorname{var}(y) Δt\), where \(Δt\) is the duration of the segment.
- min_size#
Minimum size of a segment that works with this cost function
- Type:
int
There exist also numba-jitclass
-ed versions of the cost classes named
CostL1Numba
and CostL2Numba
.
Offline Bayesian changepoint detection#
Detector class#
- class sdt.changepoint.BayesOffline(prior='const', obs_likelihood='gauss', prior_params={}, obs_likelihood_params={}, numba_logsumexp=True, engine='numba')[source]#
Bayesian offline changepoint detector
This is an implementation of [Fear2006] based on the one from the bayesian_changepoint_detection python package.
- Parameters:
prior ({"const", "geometric", "neg_binomial"} or prior class, optional) –
Prior probabiltiy. This can either be a string describing the prior or a type or instance of a class implementing the prior, as for example
ConstPrior
,GeometricPrior
, orNegBinomialPrior
.If a string or a type ar passed, a class instance will be created passing prior_params to
__init__
.If “const”, use
ConstPrior
. If “geometric”, useGeometricPrior
. If “neg_binomial”, useGeometricPrior
. Defaults to “const”.obs_likelihood ({"gauss", "ifm", "full_cov"} or likelihood class, opt.) –
Observation likelhood. This can either be a string describing the likelihood or a type or instance of a class implementing the likelihood, as for example
GaussianObsLikelihood
,IfmObsLikelihood
, orFullCovObsLikelihood
.If a string or a type ar passed, a class instance will be created passing obs_likelihood_params to
__init__
.If “gauss”, use
GaussianObsLikelihood
. If “ifm”, useIfmObsLikelihood
. If “full_cov”, useFullCovObsLikelihood
. Defaults to “gauss”.For multivariate data, “ifm” or “full_cov” is recommended.
prior_params (dict, optional) – Parameters to prior’s
__init__
if it needs to be constructed. Defaults to {}.obs_likelihood_params (dict, optional) – Parameters to obs_likelihood’s
__init__
if it needs to be constructed. Defaults to {}.engine ({"python", "numba"}, optional) – If “numba”, use the numba-accelerated implementation. Defaults to “numba”.
numba_logsumexp (bool, optional) – If True, use numba-accelerated
logsumexp()
, otherwise usescipy.special.logsumexp()
. Defaults to True.
- find_changepoints(data, prob_threshold=None, full_output=False, truncate=-20)[source]#
Find changepoints in datasets
- Parameters:
data (array-like) – Data array
prob_threshold (float or None, optional) – If this is a float, local maxima in the changepoint probabilities are considered changepoints, if they are above the threshold. In that case, an array of changepoints is returned. If None, an array of probabilities is returned. Defaults to None.
full_output (bool, optional) – Whether to return only the probabilities for a changepoint as a function of time or the full information. Defaults to False, i.e. only probabilities.
truncate (float, optional) – Speed up calculations by truncating a sum if the summands provide negligible contributions. This parameter is the exponent of the threshold. Set to
-inf
to turn off. Defaults to -20.
- Returns:
cp_or_prob (numpy.ndarray) – Probabilities for a changepoint as a function of time (if
prob_threshold=None
) or the enumeration of changepoints (if prob_threshold is not None).q (numpy.ndarray) –
Q[t]
is the log-likelihood of data[t, n]
. Only returned iffull_output=True
andprob_threshold=None
.p (numpy.ndarray) –
P[t, s]
is the log-likelihood of a datasequence[t, s]
, given there is no changepoint betweent
ands
. Only returned iffull_output=True
andprob_threshold=None
.pcp (numpy.ndarray) –
Pcp[i, t]
is the log-likelihood that thei
-th changepoint is at time stept
. To actually get the probility of a changepoint at time stept
, sum the probabilities (which is prob). Only returned iffull_output=True
andprob_threshold=None
.
Prior probability classes#
- class sdt.changepoint.ConstPrior[source]#
Class implementing a constant prior
\(P(t) = 1 / (\text{len}(data) + 1)\)
- class sdt.changepoint.GeometricPrior(p)[source]#
Class implementing a geometrically distributed prior
\(P(t) = p (1 - p)^{t - 1}\)
- Parameters:
p (float) – p in the forumla above
- class sdt.changepoint.NegBinomialPrior(k, p)[source]#
Class implementing a neg-binomially distributed prior
\(P(t) = {{t - k}\choose{k - 1}} p^k (1 - p)^{t - k}\)
- Parameters:
k (int) – k in the formula above
p (float) – p in the formula above
There exist also numba-jitclass
-ed versions of the first two classes named
ConstPriorNumba
and GeometricPriorNumba
.
Observation likelihood classes#
- class sdt.changepoint.GaussianObsLikelihood[source]#
Gaussian observation likelihood
- likelihood(t, s)#
Get likelihood
- Parameters:
t (int) – First and last time point to consider
s (int) – First and last time point to consider
- Returns:
likelihood
- Return type:
float
- set_data(data)#
Set data for calculation of the likelihood
- Parameters:
data (numpy.ndarray, shape(n, m)) – m datasets of n data points
- class sdt.changepoint.IfmObsLikelihood[source]#
Independent features model from Xuan et al.
See Xuan Xiang, Kevin Murphy: “Modeling Changing Dependency Structure in Multivariate Time Series”, ICML (2007), pp. 1055–1062.
- likelihood(t, s)#
Get likelihood
- Parameters:
t (int) – First and last time point to consider
s (int) – First and last time point to consider
- Returns:
likelihood
- Return type:
float
- set_data(data)#
Set data for calculation of the likelihood
- Parameters:
data (numpy.ndarray, shape(n, m)) – m datasets of n data points
- class sdt.changepoint.FullCovObsLikelihood[source]#
Full covariance model from Xuan et al.
See Xuan Xiang, Kevin Murphy: “Modeling Changing Dependency Structure in Multivariate Time Series”, ICML (2007), pp. 1055–1062.
- likelihood(t, s)#
Get likelihood
- Parameters:
t (int) – First and last time point to consider
s (int) – First and last time point to consider
- Returns:
likelihood
- Return type:
float
- set_data(data)#
Set data for calculation of the likelihood
- Parameters:
data (numpy.ndarray, shape(n, m)) – m datasets of n data points
There exist also numba-jitclass
-ed versions of the classes named
GaussianObsLikelihoodNumba
, IfmObsLikelihoodNumba
, and
FullCovObsLikelihoodNumba
.
Online Bayesian changepoint detection#
Detector class#
- class sdt.changepoint.BayesOnline(hazard='const', obs_likelihood='student_t', hazard_params={'time_scale': 250.0}, obs_params={'alpha': 0.1, 'beta': 0.01, 'kappa': 1.0, 'mu': 0.0}, engine='numba')[source]#
Bayesian online changepoint detector
This is an implementation of [Adam2007] based on the one from the bayesian_changepoint_detection python package.
Since this is an online detector, it keeps state. One can call
update()
for each datapoint and then extract the changepoint probabilities fromprobabilities
.- Parameters:
hazard_func ("const" or callable, optional) – Hazard function. This has to take two parameters, the first being an array of runlengths, the second an array of parameters. See the hazard_params parameter for details. It has to return the hazards corresponding to the runlengths. If “const”, use
constant_hazard()
. Defaults to “const”.obs_likelihood ("student_t" or type) – Class implementing the observation likelihood. See
StudentTPython
for an example. If “student_t”, useStudentTPython
. Defaults to “student_t”.hazard_params (numpy.ndarray, optional) – Parameters to pass as second argument to the hazard function. Defaults to
numpy.array([250])
.obs_params (list, optional) – Parameters to pass to the observation_likelihood constructor. Defaults to
[0.1, 0.01, 1., 0.]
.
- reset()[source]#
Reset the detector
All previous data will be forgotten. Useful if one wants to start on a new dataset.
- update(x)[source]#
Add data point an calculate changepoint probabilities
- Parameters:
x (number) – New data point
- find_changepoints(data, past=3, prob_threshold=None)[source]#
Analyze dataset
This resets the detector and calls
update()
on all data points.- Parameters:
data (array-like) – Dataset
past (int, optional) – How many datapoints into the past to look. Larger values will increase robustness, but also latency, meaning that if past equals some number x, a changepoint within the last x data points cannot be detected. Defaults to 3.
prob_threshold (float or None, optional) – If this is a float, local maxima in the changepoint probabilities are considered changepoints, if they are above the threshold. In that case, an array of changepoints is returned. If None, an array of probabilities is returned. Defaults to None.
- Returns:
Probabilities for a changepoint as a function of time (if
prob_threshold=None
) or the enumeration of changepoints (if prob_threshold is not None). Note that while the algorithm sets the probability for a changepoint at index 0 to 100% (meaning that there is a changepoint before the start of the sequence), the returned probability array has the 0-th entry set to 0.- Return type:
numpy.ndarray
- get_probabilities(past)[source]#
Get changepoint probabilities
To calculate the probabilities, look a number of data points (as given by the past parameter) into the past to increase robustness.
There is always 100% probability that there was a changepoint at the start of the signal due to how the algorithm is implemented; one should filter that out if necessary or use
find_changepoints()
, which does that for you.- Parameters:
past (int) – How many datapoints into the past to look. Larger values will increase robustness, but also latency, meaning that if past equals some number x, a changepoint within the last x data points cannot be detected.
- Returns:
Changepoint probabilities as a function of time. The length of the array equals the number of datapoints - past.
- Return type:
numpy.ndarray
Hazard classes#
- class sdt.changepoint.ConstHazard(time_scale)[source]#
Class implementing a constant hazard function
- Parameters:
time_scale (float) – Time scale
There exists also a numba-jitclass
-ed version of the class named
ConstHazardNumba
.
Observation likelihood classes#
- class sdt.changepoint.StudentT(alpha, beta, kappa, mu)[source]#
Student T observation likelihood
- Parameters:
alpha (float) – Distribution parameters
beta (float) – Distribution parameters
kappa (float) – Distribution parameters
mu (float) – Distribution parameters
There exists also a numba-jitclass
-ed version of the class named
StudentTNumba
.
Calculation of statistics for data segments#
- sdt.changepoint.segment_stats(data, changepoints, stat_funcs, mask=None, stat_margin=0, return_len='segments')[source]#
Calculate statistics for each segment found via changepoint detection
- Parameters:
data (ndarray) – Data for which changepoints are calculated. Either a 1D array or a 2D array of column-wise data.
changepoints (ndarray | Callable[[ndarray], ndarray]) – Sequence of changepoints or function that calculates changepoints from data
stat_funcs (Callable | Iterable[Callable]) – Apply these functions to data segments identified via changepoint detection.
numpy.mean()
andnumpy.median()
are prime candidates for this. The functions should accept an axis keyword argument like many numpy functions do.mask (ndarray | None) – Boolean array of the same length as data. Changepoint detection is only performed using data entries that have a corresponding True entry in the mask. None is equivalent to a mask whose entries are all True.
stat_margin (int) – Data at the changepoint may be influenced by the change. This parameter specifies the number of datapoints before and after a changepoint to ignore in the calculation of the statistics.
return_len (str) – If
"segments"
, return arrays of lengthlen(changepoints) + 1
. If"data"
, return arrays of the same length as the data parameter.
- Returns:
Array of segment numbers, starting at zero and increasing by one after
each detected changepoint. -1 if changepoint detection failed, e.g. due
to NaNs in data. Also returns the corresponding 2D array in which
each column holds the result of one of stat_funcs applied to each
segment.
If
return_len == "segments"
, the arrays contain a single entry foreach segment in the trace. If
return_len == "data"
, each entry in thereturned arrays corresponds to one entry in data.
- Return type:
Tuple[ndarray, ndarray]
Plotting of changepoints#
- sdt.changepoint.plot_changepoints(data, changepoints, time=None, style='shade', segment_alpha=0.2, segment_colors=['#4286f4', '#f44174'], ax=None)[source]#
Plot time series with changepoints
- Parameters:
data (numpy.ndarray) – Data points in time series
changepoints (list-like of int) – Indices of changepoints
time (numpy.ndarray or None) – x axis values. If None (the default), use 1, 2, 3…
style ({"shade", "line"}, optional) – How to indicate changepoints. If “shade” (default), draw the background of the segments separated by changepoints in different colors. If “lines”, draw vertical lines at the positions of th changepoints.
segment_alpha (float, optional) – Alpha value of the background color in case of
style="shade"
. Defaults to 0.2.segment_colors (list of str) – Sequence of background colors for use with
style="shade"
. Defaults to["#4286f4", "#f44174"]
, which are purple and pink.ax (matplotlib.axes.Axes or None) – Axes object to plot on. If None (the default), get it by calling pyplot’s
gca()
.
- Returns:
Axes object used for plotting.
- Return type:
matplotlib.axes.Axes
References
Killick et al.: “Optimal Detection of Changepoints With a Linear Computational Cost”, Journal of the American Statistical Association, Informa UK Limited, 2012, 107, 1590–1598