**Course**: https://ucilnica.fri.uni-lj.si/course/view.php?id=81**Lecturer**: izr. prof. dr. Zoran Bosnić**Language**: English**Date**: 2014-05-14

Course overview (part 3):

- comparison of stationary learning vs. incremental learning; streams, design, limitations
- data summarization: statistics, histogram, wavelets, condensing datasets
- incremental learning models: incremental decision trees (VFDT, functional tree leaves), Hoeffding bound
- concept drift/novelty detection, reacting to changes: extreme values, decision structure, frequency, distances
- clustering from data streams: hierarchical, micro, grid clustering,
- evaluation of streaming algorithms
- matrix factorization with temporal dynamics (recommendation systems)

Lectures with hands-on exercises in Statistical package R. Homework problem for grading: competition in modeling given streaming data. Project: will be discussed individually, taking some data related to PhD and doing something incrementally.

## Learning from data streams

**Date**: 2014-05-14

### Introduction

Static learning:

- fixed dataset
- relational/attributional form
- classification/regression prediction problems

Incremental learning:

*data streams*= changing data*concept drift*= unpredictable data distribution*time series*= measurements with timestamps, transformation into relational data is possible*sliding window*= holds most recent examples- sequence-based window – 20 last measurements
- timestamp-based window:
- linear – 20 samples with 15 min between
- non-linear – 20 samples with increasing space between them
- adaptive – shrinks on demand (for concept drift)

### Algorithm: ADWIN (ADaptive sliding WINdow)

- split window into two subwindows such that their average is the same (test all combinations)
- drop elements if it is not
- comparison of means from two subwindows must take into account their stability (from how many elements the mean was computed)
- one parameter \(\delta\)

### Data synopsis

Compress incoming information from a stream to make processing feasible, not store all data.

- representative sample: reservoir sampling algorithm
- histograms
- sketches
- discrete transforms and wavelets: Haar wavelets

## Predicting from data streams

**Date**: 2014-05-21

Options:

*for time series*: statistics, transformations, fit coefficients*for relational/attributional data*: adapt common learning algorithms, specialized incremental learning algorithms*transform time series into attributional form*:- temporal attributes – periodical attributes based on assumptions what makes sense (eg. daily/weekly sin/cos function)
- historical attributes – eg. measurements at some hours before
- real values – for manual checking

Incremental models do not require all historical data to be available and can be build incrementally. Incremental learning slides a window through data and updates the model with new examples as they arrive.

### Algorithm: Very fast decision tree algorithms (VFDT)

- only discrete attributes
- tree is not binary
- split a leaf node only if there is sufficient statistical evidence

Growing the tree:

- update statistics in the leaf
- evaluate \(H()\) on all attributes, compute difference between best two attributes: \[ \Delta H = H(a_1) - H(a_2) \]
- compare Hoeffding bound (Hoeffding, 1963): \[ \epsilon = \sqrt{\frac{R^2 ln(2 / \delta)}{2n}} \]

## Concept drift detection

Distribution of examples or what is being modeled can change over time – eg. as a consequence of a hidden variable or process. We must be cautious to differentiate from temporal noise.

Categorizations based on:

*data management*:- full memory (by increasing recent weights)
- partial memory (fixed or adaptive window)

*detection methods*:- monitor performance indicators
- compare time-windows

*adaptation capability*:- blind methods (restart at regulated interval)
- informed methods

Tests:

- Page-Hinkley test (PH test): compute difference against average, alarm if larger than allowed
- Statistical process control: compute probability of error \(p_i\) and its standard deviation \(s_i\), system switches between states according to bounds (“in control”, “warning”, “out of control”)

## Clustering from data streams

**Date**: 2014-05-28

- online clustering
- evaluating incremental algorithm

In **online clustering** the task is to maintain a continuously consistent good clustering using a small amount of memory and time (grouping objects into groups, unsupervised learning).

Approaches:

*hierarchical clustering*: build micro clusters for cluster features (CF), maintain statistics for each one (not storing examples), later combine them into macro clusters*partitioning clustering*(k-means)*density-based clustering**grid-based clustering**model-based clustering*

### Algorithm: BIRCH

Hierarchical clustering approach with acronym meaning Balanced Iterative Reducing and Clustering using Hierarchies (Zhang et al., 1996).

- build a tree
- every node has 6 cluster features and descendant subtrees
- each node summarizes the statistics of descending nodes
- parameters: \(B\) - branch factor, \(T\) - maximum absorption distance
- put new examples arrive, the closest CF can absorb it if closer than radius \(T\), if not a new CF entry is added, if there is no room split the parent node

An extension is CluStream (Aggarwal et al., 2003) that maintains just \(q\) micro clusters, discards oldest.

## Evaluation of streaming algorithms

Performance measures: MSE, CA…

Approach in **batch learning** is to split data into training set (training set, validation set) and test set.

Issues in streaming data:

- dataset not fixed
- concept drifts
- decision models change over time

Approaches influenced by order of examples:

*holdout and independent test set**predictive sequential (prequential) evaluation*

Solution is to use *prequential estimate with forgetting mechanism* (using either time window or fading factors).

Q-estimate is used to compare two algorithms. Taking a logarithm of accumulated losses provides positive values if first algorithm is better than the second.

\[ Q_i(A,B) = log(\frac{S^A_i}{S^B_i}) \]