Data analysis module outline

March 29, 2019

For Warwick University module 910, Data analysis.

2. math back ground

why frequency/rank for heavy tailed?

The frequency/rank plot because the CCDF is a monotonous function. Moreover, unlike the PDF function, the CCDF displays a significantly lower variability in the tail.

independence: event and random variables
- HH and TT
  - not independent because the mere occurrence of one would tell us for sure that the other can’t happen.
Conditional Probability
Heavy-Tailed Distributions roughly non-negligible probabilities at large values far to the right.
- CDF+CCDF=1
- Real Histograms
- log-log plot is also a great trick to visualize data spanning a wide range of data.
- Pareto Distribution
- Good (Enough) Method: Frequency/Rank Plot the sharp drop in variance in the tail is the fact hat the CCDF is a monotonous function
- the Q-Q curve is non-decreasing

3. statistics

quantile-quantile plot
- Allows comparison of attributes of different cardinality
- Plot points corresponding to f-quantile of each attribute
- If close to line y=x, evidence for same distribution
PMCC = (E(XY)-EXEY)/($\sigma x \sigma y$) = cov(xy)/($\sigma x \sigma y$)
- pmcc between -1,1
- Measures linear dependence in terms of simple quantities

1-w2{10} 20190522113900.png-w10 > table, correlated. < table, no correlation.

3.1 handle problems

numeric data, ranges vastly different

range scaling, to 0,1.
statistical scaling, (x-u)/a

ordered data: replace each point with its position in the ordering
mixed data: encode each dimension into the range 0,1. use Lp distance
handle missing values

drop the whole record
fill in missing values manually
as unknown, ensure can handle
no ideal solution

Noise, hard to tell. statistical tests
Outliers: extreme values

finding in numerica data
- sanity check, mean ,max, std dev
- visual, plot , far from rest
- rule based, set limits
- data based, declare if > 6
categoric data
- visual, look at frequency statistics, low frequency

dealing outliers: delete, clip(change to max/min), treat as missing

discretization: features have too many values

binning, place data into bands
use existing hierarchies in data

random sampleing, stratifies, take samples from subsets
feature selection

principal components analysis
greedy attribute selection

4.regression

regression 4
- Assuming unbiased independent noise with bounded variance
- PMCC , Product-Moment Correlation Coefficient determines the quality of the linear regression between X and Y
- R2 = Cov2(X,Y) /(Var(X)Var(Y)) = PMCC^2
  - measure the coefficient of determination, how much the model explains the variance in Y
  - 1, good fit, 0 ,weak fit to the model
- Two perfectly correlated attributes can’t be used in regression
  - Violates requirement of no linearly dependent columns
  - Regularization can fix this: ensures matrix is non-singular

dealing with categoric attributes
- numerically encode categoric explanatory variables
- simple case: 0/1, include this new variable in the regression
- general categoric attributes, create a binary variable for each possibility
Regularization in regression: To handle the danger of overfitting the model of the data, include the parameters in the optimization

6. classification

measures
- precision : tp/(tp+fp)
- recall: tp/(tp+fn)
- F1 2precision*recall/(precision+recall)
- ROC: recall-y, fp-x(fp/(fp+tn))
k-fold cross validation
- Divide data into k equal pieces (folds)
- use k-1 for training, evaluate accuracy on 1 fold, repeat
Entropy H= -plog(p),
- information gain=H-Hsplit
- gain ratio = (Hx-Hsplit)/Hsplit
kappa
- higher is better
- compares accuracy to random guessing
- Po as the observed agreement between two labelers
- Pe is the agreement if both are labeling at random
- K = (Po-Pe)/(1-Pe)
pruning
- prepruning:add additional stopping conditions when building
- Postpruning: build full tree, then decide which branches to prune
SVM
- assume data is linearly separable
- seeks the plane that maximizes the margin between the classes

7. clustering

k-center Furthest point for minimize the max distance between pairs in same cluster
- 2-approximation
  - asuume the furthest point to all center>2opt
  - then the distances between all centers are also>2opt
  - we have k+1 points with distances>2opt between every pair
  - since each point has a center with distance < opt
  - there exists a pair of points with same center, the distance is at most 2opt
  - contradiction!
k-medians minimize the average distance from each point to its closest centre
k means Lloyd minimize the sum of squared distance
- assign each point to its cloest center
- compute new centroid of each cluster, until no change
Hierarchical agglomerative HAC
- merge closest pair of clusters
- single, complete, average link
  - single: d(C1,C2) = min d(c1 in C1, c2 in C2)
  - complete: d(C1,C2) = max d(c1 in C1, c2 in C2) max nodes between each cluster, then find the min
  - avg
- Dc-Tc >= Db - Tb
DBSCAN based on local density of data points
- Epsilon, the radius to search for neighbours
- MinPts, min start points to make one cluster
- density reachable .->.
- density connected .<-.->.

8. recommendation

recommendation 8
- neighbourhood-based collaborative filtering
- item-based collaborative filtering
- matrix factoraization-singular value decomposition
  - solving the optimization
    - gradient descent
    - least squares
  - adding biases

social networks 9
- concept
  - in-degree, out degree.
  - distance
  - diameter
- link prediction on graphs
  - simplest: common neighbours will link
  - weighting common neighbours
- finding important nodes
  - graph centrality
    - eccentrality
    - betweenness centrality
- pageRank
  - eigenvector formulation
    - r=Mr
  - power iteration method
  - random walk interpretation
- classification in social networks
clustering coefficient
- measures the fraction of triangles
- (number of triangles)/(number of pairs of common neighbours)

page-rank

here (1,2)(1,3)

key shortcoming of simplified version?

The problem is the periodic behavior, meaning that PageRank cannot identify the most important web-page. One solution is to create a self-loop (e.g., (1, 1)), in which case the iterative procedure from (a) would be guaranteed to converge.

main objective? how achieved?

PageRank computes the importance of web-pages. This is achieved by leveraging the transition matrix.

2. math back ground

3. statistics

3.1 handle problems

4.regression

6. classification

7. clustering

8. recommendation

9. social network analysis

page-rank