[UI-part3] Incremental learning from data streams

Lecture notes of Artificial Intelligence

Course: https://ucilnica.fri.uni-lj.si/course/view.php?id=81
Lecturer: izr. prof. dr. Zoran Bosnić
Language: English
Date: 2014-05-14

Course overview (part 3):

  • comparison of stationary learning vs. incremental learning; streams, design, limitations
  • data summarization: statistics, histogram, wavelets, condensing datasets
  • incremental learning models: incremental decision trees (VFDT, functional tree leaves), Hoeffding bound
  • concept drift/novelty detection, reacting to changes: extreme values, decision structure, frequency, distances
  • clustering from data streams: hierarchical, micro, grid clustering,
  • evaluation of streaming algorithms
  • matrix factorization with temporal dynamics (recommendation systems)

Lectures with hands-on exercises in Statistical package R. Homework problem for grading: competition in modeling given streaming data. Project: will be discussed individually, taking some data related to PhD and doing something incrementally.

Learning from data streams

Date: 2014-05-14

Introduction

Static learning:

  • fixed dataset
  • relational/attributional form
  • classification/regression prediction problems

Incremental learning:

  • data streams = changing data
  • concept drift = unpredictable data distribution
  • time series = measurements with timestamps, transformation into relational data is possible
  • sliding window = holds most recent examples
    • sequence-based window – 20 last measurements
    • timestamp-based window:
      • linear – 20 samples with 15 min between
      • non-linear – 20 samples with increasing space between them
      • adaptive – shrinks on demand (for concept drift)

Algorithm: ADWIN (ADaptive sliding WINdow)

  • split window into two subwindows such that their average is the same (test all combinations)
  • drop elements if it is not
  • comparison of means from two subwindows must take into account their stability (from how many elements the mean was computed)
  • one parameter \(\delta\)

Data synopsis

Compress incoming information from a stream to make processing feasible, not store all data.

  • representative sample: reservoir sampling algorithm
  • histograms
  • sketches
  • discrete transforms and wavelets: Haar wavelets

Predicting from data streams

Date: 2014-05-21

Options:

  • for time series: statistics, transformations, fit coefficients
  • for relational/attributional data: adapt common learning algorithms, specialized incremental learning algorithms
  • transform time series into attributional form:
    • temporal attributes – periodical attributes based on assumptions what makes sense (eg. daily/weekly sin/cos function)
    • historical attributes – eg. measurements at some hours before
    • real values – for manual checking

Incremental models do not require all historical data to be available and can be build incrementally. Incremental learning slides a window through data and updates the model with new examples as they arrive.

Algorithm: Very fast decision tree algorithms (VFDT)

  • only discrete attributes
  • tree is not binary
  • split a leaf node only if there is sufficient statistical evidence

Growing the tree:

  1. update statistics in the leaf
  2. evaluate \(H()\) on all attributes, compute difference between best two attributes: \[ \Delta H = H(a_1) - H(a_2) \]
  3. compare Hoeffding bound (Hoeffding, 1963): \[ \epsilon = \sqrt{\frac{R^2 ln(2 / \delta)}{2n}} \]

Concept drift detection

Distribution of examples or what is being modeled can change over time – eg. as a consequence of a hidden variable or process. We must be cautious to differentiate from temporal noise.

Categorizations based on:

  • data management:
    • full memory (by increasing recent weights)
    • partial memory (fixed or adaptive window)
  • detection methods:
    • monitor performance indicators
    • compare time-windows
  • adaptation capability:
    • blind methods (restart at regulated interval)
    • informed methods

Tests:

  • Page-Hinkley test (PH test): compute difference against average, alarm if larger than allowed
  • Statistical process control: compute probability of error \(p_i\) and its standard deviation \(s_i\), system switches between states according to bounds (“in control”, “warning”, “out of control”)

Clustering from data streams

Date: 2014-05-28

  • online clustering
  • evaluating incremental algorithm

In online clustering the task is to maintain a continuously consistent good clustering using a small amount of memory and time (grouping objects into groups, unsupervised learning).

Approaches:

  • hierarchical clustering: build micro clusters for cluster features (CF), maintain statistics for each one (not storing examples), later combine them into macro clusters
  • partitioning clustering (k-means)
  • density-based clustering
  • grid-based clustering
  • model-based clustering

Algorithm: BIRCH

Hierarchical clustering approach with acronym meaning Balanced Iterative Reducing and Clustering using Hierarchies (Zhang et al., 1996).

  • build a tree
  • every node has 6 cluster features and descendant subtrees
  • each node summarizes the statistics of descending nodes
  • parameters: \(B\) - branch factor, \(T\) - maximum absorption distance
  • put new examples arrive, the closest CF can absorb it if closer than radius \(T\), if not a new CF entry is added, if there is no room split the parent node

An extension is CluStream (Aggarwal et al., 2003) that maintains just \(q\) micro clusters, discards oldest.

Evaluation of streaming algorithms

Performance measures: MSE, CA…

Approach in batch learning is to split data into training set (training set, validation set) and test set.

Issues in streaming data:

  • dataset not fixed
  • concept drifts
  • decision models change over time

Approaches influenced by order of examples:

  • holdout and independent test set
  • predictive sequential (prequential) evaluation

Solution is to use prequential estimate with forgetting mechanism (using either time window or fading factors).

Q-estimate is used to compare two algorithms. Taking a logarithm of accumulated losses provides positive values if first algorithm is better than the second.

\[ Q_i(A,B) = log(\frac{S^A_i}{S^B_i}) \]