[UI-part3] Incremental learning from data streams

Course: https://ucilnica.fri.uni-lj.si/course/view.php?id=81
Lecturer: izr. prof. dr. Zoran Bosnić
Language: English
Date: 2014-05-14

Course overview (part 3):

comparison of stationary learning vs. incremental learning; streams, design, limitations
data summarization: statistics, histogram, wavelets, condensing datasets
incremental learning models: incremental decision trees (VFDT, functional tree leaves), Hoeffding bound
concept drift/novelty detection, reacting to changes: extreme values, decision structure, frequency, distances
clustering from data streams: hierarchical, micro, grid clustering,
evaluation of streaming algorithms
matrix factorization with temporal dynamics (recommendation systems)

Lectures with hands-on exercises in Statistical package R. Homework problem for grading: competition in modeling given streaming data. Project: will be discussed individually, taking some data related to PhD and doing something incrementally.

Learning from data streams

Date: 2014-05-14

Introduction

Static learning:

fixed dataset
relational/attributional form
classification/regression prediction problems

Incremental learning:

data streams = changing data
concept drift = unpredictable data distribution
time series = measurements with timestamps, transformation into relational data is possible
sliding window = holds most recent examples
- sequence-based window – 20 last measurements
- timestamp-based window:
  - linear – 20 samples with 15 min between
  - non-linear – 20 samples with increasing space between them
  - adaptive – shrinks on demand (for concept drift)

Algorithm: ADWIN (ADaptive sliding WINdow)

split window into two subwindows such that their average is the same (test all combinations)
drop elements if it is not
comparison of means from two subwindows must take into account their stability (from how many elements the mean was computed)
one parameter \(\delta\)

Data synopsis

Compress incoming information from a stream to make processing feasible, not store all data.

representative sample: reservoir sampling algorithm
histograms
sketches
discrete transforms and wavelets: Haar wavelets

Predicting from data streams

Date: 2014-05-21

Options:

for time series: statistics, transformations, fit coefficients
for relational/attributional data: adapt common learning algorithms, specialized incremental learning algorithms
transform time series into attributional form:
- temporal attributes – periodical attributes based on assumptions what makes sense (eg. daily/weekly sin/cos function)
- historical attributes – eg. measurements at some hours before
- real values – for manual checking

Incremental models do not require all historical data to be available and can be build incrementally. Incremental learning slides a window through data and updates the model with new examples as they arrive.

Algorithm: Very fast decision tree algorithms (VFDT)

only discrete attributes
tree is not binary
split a leaf node only if there is sufficient statistical evidence

Growing the tree:

update statistics in the leaf
evaluate \(H()\) on all attributes, compute difference between best two attributes: \[ \Delta H = H(a_1) - H(a_2) \]
compare Hoeffding bound (Hoeffding, 1963): \[ \epsilon = \sqrt{\frac{R^2 ln(2 / \delta)}{2n}} \]

Concept drift detection

Distribution of examples or what is being modeled can change over time – eg. as a consequence of a hidden variable or process. We must be cautious to differentiate from temporal noise.

Categorizations based on:

data management:
- full memory (by increasing recent weights)
- partial memory (fixed or adaptive window)
detection methods:
- monitor performance indicators
- compare time-windows
adaptation capability:
- blind methods (restart at regulated interval)
- informed methods

Tests:

Page-Hinkley test (PH test): compute difference against average, alarm if larger than allowed
Statistical process control: compute probability of error \(p_i\) and its standard deviation \(s_i\), system switches between states according to bounds (“in control”, “warning”, “out of control”)

Clustering from data streams

Date: 2014-05-28

online clustering
evaluating incremental algorithm

In online clustering the task is to maintain a continuously consistent good clustering using a small amount of memory and time (grouping objects into groups, unsupervised learning).

Approaches:

hierarchical clustering: build micro clusters for cluster features (CF), maintain statistics for each one (not storing examples), later combine them into macro clusters
partitioning clustering (k-means)
density-based clustering
grid-based clustering
model-based clustering

Algorithm: BIRCH

Hierarchical clustering approach with acronym meaning Balanced Iterative Reducing and Clustering using Hierarchies (Zhang et al., 1996).

build a tree
every node has 6 cluster features and descendant subtrees
each node summarizes the statistics of descending nodes
parameters: \(B\) - branch factor, \(T\) - maximum absorption distance
put new examples arrive, the closest CF can absorb it if closer than radius \(T\), if not a new CF entry is added, if there is no room split the parent node

An extension is CluStream (Aggarwal et al., 2003) that maintains just \(q\) micro clusters, discards oldest.

Evaluation of streaming algorithms

Performance measures: MSE, CA…

Approach in batch learning is to split data into training set (training set, validation set) and test set.

Issues in streaming data:

dataset not fixed
concept drifts
decision models change over time

Approaches influenced by order of examples:

holdout and independent test set
predictive sequential (prequential) evaluation

Solution is to use prequential estimate with forgetting mechanism (using either time window or fading factors).

Q-estimate is used to compare two algorithms. Taking a logarithm of accumulated losses provides positive values if first algorithm is better than the second.

\[ Q_i(A,B) = log(\frac{S^A_i}{S^B_i}) \]