For Warwick University module 910, Data analysis.
2. math back ground
- why frequency/rank for heavy tailed?
The frequency/rank plot because the CCDF is a monotonous function. Moreover, unlike the PDF function, the CCDF displays a significantly lower variability in the tail.
- independence: event and random variables
- HH and TT
- not independent because the mere occurrence of one would tell us for sure that the other can’t happen.
- HH and TT
- Conditional Probability
- Heavy-Tailed Distributions
roughly non-negligible probabilities at large values far to the right.
- CDF+CCDF=1
- Real Histograms
- log-log plot is also a great trick to visualize data spanning a wide range of data.
- Pareto Distribution
- Good (Enough) Method: Frequency/Rank Plot the sharp drop in variance in the tail is the fact hat the CCDF is a monotonous function
- the Q-Q curve is non-decreasing
3. statistics
- quantile-quantile plot
- Allows comparison of attributes of different cardinality
- Plot points corresponding to f-quantile of each attribute
- If close to line y=x, evidence for same distribution
- PMCC = (E(XY)-EXEY)/($\sigma x \sigma y$) = cov(xy)/($\sigma x \sigma y$)
- pmcc between -1,1
- Measures linear dependence in terms of simple quantities
> table, correlated. < table, no correlation.
3.1 handle problems
- numeric data, ranges vastly different
- range scaling, to 0,1.
- statistical scaling, (x-u)/a
- ordered data: replace each point with its position in the ordering
- mixed data: encode each dimension into the range 0,1. use Lp distance
- handle missing values
- drop the whole record
- fill in missing values manually
- as unknown, ensure can handle
- no ideal solution
- Noise, hard to tell. statistical tests
- Outliers: extreme values
- finding in numerica data
- sanity check, mean ,max, std dev
- visual, plot , far from rest
- rule based, set limits
- data based, declare if > 6
- categoric data
- visual, look at frequency statistics, low frequency
dealing outliers: delete, clip(change to max/min), treat as missing
- discretization: features have too many values
- binning, place data into bands
- use existing hierarchies in data
- random sampleing, stratifies, take samples from subsets
- feature selection
- principal components analysis
- greedy attribute selection
4.regression
- regression 4
- Assuming unbiased independent noise with bounded variance
- PMCC , Product-Moment Correlation Coefficient determines the quality of the linear regression between X and Y
- R2 = Cov2(X,Y) /(Var(X)Var(Y)) = PMCC^2
- measure the coefficient of determination, how much the model explains the variance in Y
- 1, good fit, 0 ,weak fit to the model
- Two perfectly correlated attributes can’t be used in regression
- Violates requirement of no linearly dependent columns
- Regularization can fix this: ensures matrix is non-singular
dealing with categoric attributes
- numerically encode categoric explanatory variables
- simple case: 0/1, include this new variable in the regression
- general categoric attributes, create a binary variable for each possibility
Regularization in regression: To handle the danger of overfitting the model of the data, include the parameters in the optimization
6. classification
measures
- precision : tp/(tp+fp)
- recall: tp/(tp+fn)
- F1 2precision*recall/(precision+recall)
- ROC: recall-y, fp-x(fp/(fp+tn))
k-fold cross validation
- Divide data into k equal pieces (folds)
- use k-1 for training, evaluate accuracy on 1 fold, repeat
Entropy H= -plog(p),
- information gain=H-Hsplit
- gain ratio = (Hx-Hsplit)/Hsplit
kappa
- higher is better
- compares accuracy to random guessing
- Po as the observed agreement between two labelers
- Pe is the agreement if both are labeling at random
- K = (Po-Pe)/(1-Pe)
pruning
- prepruning:add additional stopping conditions when building
- Postpruning: build full tree, then decide which branches to prune
SVM
- assume data is linearly separable
- seeks the plane that maximizes the margin between the classes
7. clustering
- k-center Furthest point for
minimize the max distance between pairs in same cluster
- 2-approximation
- asuume the furthest point to all center>2opt
- then the distances between all centers are also>2opt
- we have k+1 points with distances>2opt between every pair
- since each point has a center with distance < opt
- there exists a pair of points with same center, the distance is at most 2opt
- contradiction!
- 2-approximation
- k-medians minimize the average distance from each point to its closest centre
- k means Lloyd
minimize the sum of squared distance
- assign each point to its cloest center
- compute new centroid of each cluster, until no change
- Hierarchical agglomerative HAC
- merge closest pair of clusters
- single, complete, average link
- single: d(C1,C2) = min d(c1 in C1, c2 in C2)
- complete: d(C1,C2) = max d(c1 in C1, c2 in C2) max nodes between each cluster, then find the min
- avg
- Dc-Tc >= Db - Tb
- DBSCAN based on local density of data points
- Epsilon, the radius to search for neighbours
- MinPts, min start points to make one cluster
- density reachable .->.
- density connected .<-.->.
8. recommendation
- recommendation 8
- neighbourhood-based collaborative filtering
- item-based collaborative filtering
- matrix factoraization-singular value decomposition
- solving the optimization
- gradient descent
- least squares
- adding biases
- solving the optimization
9. social network analysis
social networks 9
- concept
- in-degree, out degree.
- distance
- diameter
- link prediction on graphs
- simplest: common neighbours will link
- weighting common neighbours
- finding important nodes
- graph centrality
- eccentrality
- betweenness centrality
- graph centrality
- pageRank
- eigenvector formulation
- r=Mr
- power iteration method
- random walk interpretation
- eigenvector formulation
- classification in social networks
- concept
clustering coefficient
- measures the fraction of triangles
- (number of triangles)/(number of pairs of common neighbours)
page-rank
here (1,2)(1,3)
- key shortcoming of simplified version?
The problem is the periodic behavior, meaning that PageRank cannot identify the most important web-page. One solution is to create a self-loop (e.g., (1, 1)), in which case the iterative procedure from (a) would be guaranteed to converge.
- main objective? how achieved?
PageRank computes the importance of web-pages. This is achieved by leveraging the transition matrix.