Changepoint detection#

The sdt.changepoint module provides alogrithms for changepoint detection, i.e. for finding changepoints in a time series.

There are several algorithms available:

The segment_stats() allows for calculating statistics for segments identified via changepoint detection. For instance, the mean value, median, etc. can be computed per segment.

Further, there is plot_changepoints() for plotting time series together with the detected changepoints.

Examples

Create some data:

>>> numpy.random.seed(123)
>>> # Make some data with a changepoint at t = 10
>>> data = numpy.concatenate([numpy.random.normal(1.5, 0.1, 10),
...                           numpy.random.normal(0, 0.1, 10)])

Use PELT for changepoint detection:

>>> det = Pelt(cost="l2", min_size=1, jump=1)
>>> det.find_changepoints(data, 1)
array([10])

Simple bayesian offline changepoint detection:

>>> det = BayesOffline("const", "gauss")
>>> det.find_changepoints(data, prob_threshold=0.4)
array([10])

Online changepoint detection can be used on data as it arrives.

>>> det = BayesOnline()
>>> while True:
>>>     x = wait_for_data()  # some routine that returns data once it arrives
>>>     det.update(x)
>>>     prob = det.get_probabilities(3)  # look three data points into past
>>>     if len(prob) >= 1 and np.any(prob[1:] > 0.8):
>>>         # There is an 80% chance that there was changepoint
>>>         break

Of course, it can be used also on a whole dataset similarly to offline detection.

>>> det = changepoint.BayesOnline()
>>> det.find_changepoints(data, 3, 0.5)
array([10])

All algorithms also work with multivariate data. In that case, data should be a 2D array with one dataset per column.

>>> # changepoint at t = 5
>>> data2 = numpy.concatenate([numpy.random.normal(0, 0.1, 5),
...                            numpy.random.normal(2.5, 0.1, 15)
>>> # multivariate data, one set per column
>>> data_m = numpy.array([data, data2]).T

PELT example:

>>> det = changepoint.Pelt(cost="l2", min_size=1, jump=1)
>>> det.find_changepoints(data_m, 1)
array([ 5, 10])

When using BayesOffline, it is recommended to choose either the “ifm” or the “full_cov” model for multivariate data.

PELT#

Detector class#

class sdt.changepoint.Pelt(cost='l2', min_size=2, jump=1, cost_params={}, engine='numba')[source]#

PELT changepoint detection

Implementation of the PELT algorithm [Kill2012]. It is compatible with the implementation from ruptures.

Examples

>>> det = Pelt(cost="l2", min_size=1, jump=1)
>>> # Make some data with a changepoint at t = 10
>>> data = np.concatenate([np.ones(10), np.zeros(10)])
>>> det.find_changepoints(data, 1)
array([10])
Parameters:
  • cost (cost class or cost class instance or str, optional) – If “l1”, use CostL1, “l2”, use CostL2. A cost class type or instance can be passed directly.

  • min_size (int, optional) – Minimum length of segments between change points. Defaults to 2.

  • jump (int, optional) – Consider only every jump-th data point to speed up calculation. Defaults to 5.

  • cost_params (dict, optional) – Parameters to pass to the cost class’s __init__. Only relevant if cost is not a class instance. Defaults to {}.

  • engine ({"python", "numba"}, optional) – If “numba”, use the numba-accelerated implementation. Defaults to “numba”.

find_changepoints(data, penalty, max_exp_cp=10)[source]#

Do the changpoint detection on data

Parameters:
  • data (numpy.ndarray, shape(n, m)) – m datasets of n data points

  • penalty (float) – Penalty of creating a new changepoint

  • max_exp_cp (int, optional) – Expected maximum number of changepoints. Memory for at least this many changepoints will be pre-allocated. If there are more, a larger array will be allocated, but this incurs some overhead. Defaults to 10.

Returns:

Array of changepoints

Return type:

numpy.ndarray, shape(n)

Cost classes#

class sdt.changepoint.CostL1[source]#

L1 norm cost

The cost is \(\sum_i |y_i - \operatorname{median}(y)|\).

min_size#

Minimum size of a segment that works with this cost function

Type:

int

set_data(data)[source]#

Set data for cost function

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

cost(t, s)[source]#

Calculate cost from time t to time s

Parameters:
  • t (int) – Start and end point

  • s (int) – Start and end point

Returns:

Cost

Return type:

float

class sdt.changepoint.CostL2[source]#

L2 norm cost

The cost is \(\operatorname{var}(y) Δt\), where \(Δt\) is the duration of the segment.

min_size#

Minimum size of a segment that works with this cost function

Type:

int

set_data(data)[source]#

Set data for cost function

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

cost(t, s)[source]#

Calculate cost from time t to time s

Parameters:
  • t (int) – Start and end point

  • s (int) – Start and end point

Returns:

Cost

Return type:

float

There exist also numba-jitclass-ed versions of the cost classes named CostL1Numba and CostL2Numba.

Offline Bayesian changepoint detection#

Detector class#

class sdt.changepoint.BayesOffline(prior='const', obs_likelihood='gauss', prior_params={}, obs_likelihood_params={}, numba_logsumexp=True, engine='numba')[source]#

Bayesian offline changepoint detector

This is an implementation of [Fear2006] based on the one from the bayesian_changepoint_detection python package.

Parameters:
  • prior ({"const", "geometric", "neg_binomial"} or prior class, optional) –

    Prior probabiltiy. This can either be a string describing the prior or a type or instance of a class implementing the prior, as for example ConstPrior, GeometricPrior, or NegBinomialPrior.

    If a string or a type ar passed, a class instance will be created passing prior_params to __init__.

    If “const”, use ConstPrior. If “geometric”, use GeometricPrior. If “neg_binomial”, use GeometricPrior. Defaults to “const”.

  • obs_likelihood ({"gauss", "ifm", "full_cov"} or likelihood class, opt.) –

    Observation likelhood. This can either be a string describing the likelihood or a type or instance of a class implementing the likelihood, as for example GaussianObsLikelihood, IfmObsLikelihood, or FullCovObsLikelihood.

    If a string or a type ar passed, a class instance will be created passing obs_likelihood_params to __init__.

    If “gauss”, use GaussianObsLikelihood. If “ifm”, use IfmObsLikelihood. If “full_cov”, use FullCovObsLikelihood. Defaults to “gauss”.

    For multivariate data, “ifm” or “full_cov” is recommended.

  • prior_params (dict, optional) – Parameters to prior’s __init__ if it needs to be constructed. Defaults to {}.

  • obs_likelihood_params (dict, optional) – Parameters to obs_likelihood’s __init__ if it needs to be constructed. Defaults to {}.

  • engine ({"python", "numba"}, optional) – If “numba”, use the numba-accelerated implementation. Defaults to “numba”.

  • numba_logsumexp (bool, optional) – If True, use numba-accelerated logsumexp(), otherwise use scipy.special.logsumexp(). Defaults to True.

find_changepoints(data, prob_threshold=None, full_output=False, truncate=-20)[source]#

Find changepoints in datasets

Parameters:
  • data (array-like) – Data array

  • prob_threshold (float or None, optional) – If this is a float, local maxima in the changepoint probabilities are considered changepoints, if they are above the threshold. In that case, an array of changepoints is returned. If None, an array of probabilities is returned. Defaults to None.

  • full_output (bool, optional) – Whether to return only the probabilities for a changepoint as a function of time or the full information. Defaults to False, i.e. only probabilities.

  • truncate (float, optional) – Speed up calculations by truncating a sum if the summands provide negligible contributions. This parameter is the exponent of the threshold. Set to -inf to turn off. Defaults to -20.

Returns:

  • cp_or_prob (numpy.ndarray) – Probabilities for a changepoint as a function of time (if prob_threshold=None) or the enumeration of changepoints (if prob_threshold is not None).

  • q (numpy.ndarray) – Q[t] is the log-likelihood of data [t, n]. Only returned if full_output=True and prob_threshold=None.

  • p (numpy.ndarray) – P[t, s] is the log-likelihood of a datasequence [t, s], given there is no changepoint between t and s. Only returned if full_output=True and prob_threshold=None.

  • pcp (numpy.ndarray) – Pcp[i, t] is the log-likelihood that the i-th changepoint is at time step t. To actually get the probility of a changepoint at time step t, sum the probabilities (which is prob). Only returned if full_output=True and prob_threshold=None.

Prior probability classes#

class sdt.changepoint.ConstPrior[source]#

Class implementing a constant prior

\(P(t) = 1 / (\text{len}(data) + 1)\)

set_data(data)[source]#

Set data for calculation of the prior

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

prior(t)[source]#

Get prior at time t.

Parameters:

t (int) – Time point

Returns:

Prior probability for time point t

Return type:

float

class sdt.changepoint.GeometricPrior(p)[source]#

Class implementing a geometrically distributed prior

\(P(t) = p (1 - p)^{t - 1}\)

Parameters:

p (float) – p in the forumla above

set_data(data)[source]#

Set data for calculation of the prior

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

prior(t)[source]#

Get prior at time t.

Parameters:

t (int) – Time point

Returns:

Prior probability for time point t

Return type:

float

class sdt.changepoint.NegBinomialPrior(k, p)[source]#

Class implementing a neg-binomially distributed prior

\(P(t) = {{t - k}\choose{k - 1}} p^k (1 - p)^{t - k}\)

Parameters:
  • k (int) – k in the formula above

  • p (float) – p in the formula above

There exist also numba-jitclass-ed versions of the first two classes named ConstPriorNumba and GeometricPriorNumba.

Observation likelihood classes#

class sdt.changepoint.GaussianObsLikelihood[source]#

Gaussian observation likelihood

likelihood(t, s)#

Get likelihood

Parameters:
  • t (int) – First and last time point to consider

  • s (int) – First and last time point to consider

Returns:

likelihood

Return type:

float

set_data(data)#

Set data for calculation of the likelihood

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

class sdt.changepoint.IfmObsLikelihood[source]#

Independent features model from Xuan et al.

See Xuan Xiang, Kevin Murphy: “Modeling Changing Dependency Structure in Multivariate Time Series”, ICML (2007), pp. 1055–1062.

likelihood(t, s)#

Get likelihood

Parameters:
  • t (int) – First and last time point to consider

  • s (int) – First and last time point to consider

Returns:

likelihood

Return type:

float

set_data(data)#

Set data for calculation of the likelihood

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

class sdt.changepoint.FullCovObsLikelihood[source]#

Full covariance model from Xuan et al.

See Xuan Xiang, Kevin Murphy: “Modeling Changing Dependency Structure in Multivariate Time Series”, ICML (2007), pp. 1055–1062.

likelihood(t, s)#

Get likelihood

Parameters:
  • t (int) – First and last time point to consider

  • s (int) – First and last time point to consider

Returns:

likelihood

Return type:

float

set_data(data)#

Set data for calculation of the likelihood

Parameters:

data (numpy.ndarray, shape(n, m)) – m datasets of n data points

There exist also numba-jitclass-ed versions of the classes named GaussianObsLikelihoodNumba, IfmObsLikelihoodNumba, and FullCovObsLikelihoodNumba.

Online Bayesian changepoint detection#

Detector class#

class sdt.changepoint.BayesOnline(hazard='const', obs_likelihood='student_t', hazard_params={'time_scale': 250.0}, obs_params={'alpha': 0.1, 'beta': 0.01, 'kappa': 1.0, 'mu': 0.0}, engine='numba')[source]#

Bayesian online changepoint detector

This is an implementation of [Adam2007] based on the one from the bayesian_changepoint_detection python package.

Since this is an online detector, it keeps state. One can call update() for each datapoint and then extract the changepoint probabilities from probabilities.

Parameters:
  • hazard_func ("const" or callable, optional) – Hazard function. This has to take two parameters, the first being an array of runlengths, the second an array of parameters. See the hazard_params parameter for details. It has to return the hazards corresponding to the runlengths. If “const”, use constant_hazard(). Defaults to “const”.

  • obs_likelihood ("student_t" or type) – Class implementing the observation likelihood. See StudentTPython for an example. If “student_t”, use StudentTPython. Defaults to “student_t”.

  • hazard_params (numpy.ndarray, optional) – Parameters to pass as second argument to the hazard function. Defaults to numpy.array([250]).

  • obs_params (list, optional) – Parameters to pass to the observation_likelihood constructor. Defaults to [0.1, 0.01, 1., 0.].

reset()[source]#

Reset the detector

All previous data will be forgotten. Useful if one wants to start on a new dataset.

update(x)[source]#

Add data point an calculate changepoint probabilities

Parameters:

x (number) – New data point

find_changepoints(data, past=3, prob_threshold=None)[source]#

Analyze dataset

This resets the detector and calls update() on all data points.

Parameters:
  • data (array-like) – Dataset

  • past (int, optional) – How many datapoints into the past to look. Larger values will increase robustness, but also latency, meaning that if past equals some number x, a changepoint within the last x data points cannot be detected. Defaults to 3.

  • prob_threshold (float or None, optional) – If this is a float, local maxima in the changepoint probabilities are considered changepoints, if they are above the threshold. In that case, an array of changepoints is returned. If None, an array of probabilities is returned. Defaults to None.

Returns:

Probabilities for a changepoint as a function of time (if prob_threshold=None) or the enumeration of changepoints (if prob_threshold is not None). Note that while the algorithm sets the probability for a changepoint at index 0 to 100% (meaning that there is a changepoint before the start of the sequence), the returned probability array has the 0-th entry set to 0.

Return type:

numpy.ndarray

get_probabilities(past)[source]#

Get changepoint probabilities

To calculate the probabilities, look a number of data points (as given by the past parameter) into the past to increase robustness.

There is always 100% probability that there was a changepoint at the start of the signal due to how the algorithm is implemented; one should filter that out if necessary or use find_changepoints(), which does that for you.

Parameters:

past (int) – How many datapoints into the past to look. Larger values will increase robustness, but also latency, meaning that if past equals some number x, a changepoint within the last x data points cannot be detected.

Returns:

Changepoint probabilities as a function of time. The length of the array equals the number of datapoints - past.

Return type:

numpy.ndarray

Hazard classes#

class sdt.changepoint.ConstHazard(time_scale)[source]#

Class implementing a constant hazard function

Parameters:

time_scale (float) – Time scale

hazard(run_lengths)[source]#

Calculate hazard

Parameters:

run_lengths (numpy.ndarray) – Run lengths

Returns:

Hazards for run lengths

Return type:

numpy.ndarray

There exists also a numba-jitclass-ed version of the class named ConstHazardNumba.

Observation likelihood classes#

class sdt.changepoint.StudentT(alpha, beta, kappa, mu)[source]#

Student T observation likelihood

Parameters:
  • alpha (float) – Distribution parameters

  • beta (float) – Distribution parameters

  • kappa (float) – Distribution parameters

  • mu (float) – Distribution parameters

reset()[source]#

Reset state

pdf(data)[source]#

Calculate probability density function (PDF)

Parameters:

data (array-like) – Data points for which to calculate the PDF

update_theta(data)[source]#

Update parameters for every possible run length

Parameters:

data (array-like) – Data points to use for update

There exists also a numba-jitclass-ed version of the class named StudentTNumba.

Calculation of statistics for data segments#

sdt.changepoint.segment_stats(data, changepoints, stat_funcs, mask=None, stat_margin=0, return_len='segments')[source]#

Calculate statistics for each segment found via changepoint detection

Parameters:
  • data (ndarray) – Data for which changepoints are calculated. Either a 1D array or a 2D array of column-wise data.

  • changepoints (ndarray | Callable[[ndarray], ndarray]) – Sequence of changepoints or function that calculates changepoints from data

  • stat_funcs (Callable | Iterable[Callable]) – Apply these functions to data segments identified via changepoint detection. numpy.mean() and numpy.median() are prime candidates for this. The functions should accept an axis keyword argument like many numpy functions do.

  • mask (ndarray | None) – Boolean array of the same length as data. Changepoint detection is only performed using data entries that have a corresponding True entry in the mask. None is equivalent to a mask whose entries are all True.

  • stat_margin (int) – Data at the changepoint may be influenced by the change. This parameter specifies the number of datapoints before and after a changepoint to ignore in the calculation of the statistics.

  • return_len (str) – If "segments", return arrays of length len(changepoints) + 1. If "data", return arrays of the same length as the data parameter.

Returns:

  • Array of segment numbers, starting at zero and increasing by one after

  • each detected changepoint. -1 if changepoint detection failed, e.g. due

  • to NaNs in data. Also returns the corresponding 2D array in which

  • each column holds the result of one of stat_funcs applied to each

  • segment.

  • If return_len == "segments", the arrays contain a single entry for

  • each segment in the trace. If return_len == "data", each entry in the

  • returned arrays corresponds to one entry in data.

Return type:

Tuple[ndarray, ndarray]

Plotting of changepoints#

sdt.changepoint.plot_changepoints(data, changepoints, time=None, style='shade', segment_alpha=0.2, segment_colors=['#4286f4', '#f44174'], ax=None)[source]#

Plot time series with changepoints

Parameters:
  • data (numpy.ndarray) – Data points in time series

  • changepoints (list-like of int) – Indices of changepoints

  • time (numpy.ndarray or None) – x axis values. If None (the default), use 1, 2, 3…

  • style ({"shade", "line"}, optional) – How to indicate changepoints. If “shade” (default), draw the background of the segments separated by changepoints in different colors. If “lines”, draw vertical lines at the positions of th changepoints.

  • segment_alpha (float, optional) – Alpha value of the background color in case of style="shade". Defaults to 0.2.

  • segment_colors (list of str) – Sequence of background colors for use with style="shade". Defaults to ["#4286f4", "#f44174"], which are purple and pink.

  • ax (matplotlib.axes.Axes or None) – Axes object to plot on. If None (the default), get it by calling pyplot’s gca().

Returns:

Axes object used for plotting.

Return type:

matplotlib.axes.Axes

References

[Kill2012] (1,2)

Killick et al.: “Optimal Detection of Changepoints With a Linear Computational Cost”, Journal of the American Statistical Association, Informa UK Limited, 2012, 107, 1590–1598

[Fear2006] (1,2)

Fearnhead, Paul: “Exact and efficient Bayesian inference for multiple changepoint problems”, Statistics and computing 16.2 (2006), pp. 203–21

[Adam2007] (1,2)

Adams and McKay: “Bayesian Online Changepoint Detection”, arXiv:0710.3742