《machine-learning-mindmap》 2 Data Processing |

《machine-learning-mindmap》

2 Data Processing

（Daniel Martinez）Feature Selection

Correlation

Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.

Covariance

A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)

Dimensionality Reduction

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Plot the variance per feature and select the features with the largest variance.

Singular Value Decomposition (SVD)

SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.

Importance

Filter Methods

Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.

Correlation

Linear Discriminant Analysis

ANOVA: Analysis of Variance

Chi-Square

Wrapper Methods

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insufficient. AND. The significant computation time when the number of variables is large.

Forward Selection

Backward Elimination

Recursive Feature Ellimination

Genetic Algorithms

Embedded Methods

Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.

Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.

Feature Encoding

Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this.

Label Encoding

One Hot Encoding

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Feature Normalisation or Scaling

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Methods

Rescaling

The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1].

Standardization

Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.

Scaling to unit length

To scale the components of a feature vector such that the complete vector has length one.

Dataset Construction

Training Dataset

A set of examples used for learning

To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using back-progapation.

Test Dataset

A set of examples used only to assess the performance of a fully-trained classifier

In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further.

Validation Dataset

A set of examples used to tune the parameters of a classifier

In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

Cross Validation

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

Data Types

Nominal - is for mutual exclusive, but not ordered, categories.

Ordinal - is one where the order matters but not the difference between values.

Interval - is a measurement where the difference between two values is meaningful.

Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0.

Data Exploration

Variable Identification

Identify Predictor (Input) and Target (output) variables.

Next,identify the data type and category of the variables.

Univariate Analysis

Continuous Features

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Categorical Features

Frequency, Histogram

Bi-variate Analysis

Finds out the relationship between two variables.

Scatter Plot

Correlation Plot - Heatmap

Two-way table

Stacked Column Chart

Chi-Square Test

Z-Test/ T-Test

ANOVA

Feature Cleaning

Missing values

Special values

Outliers

Obvious inconsistencies

Feature Imputation

Hot-Deck

The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.

Cold-Deck

Selects donors from another dataset to complete missing data.

Mean-substitution

Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.

Regression

A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing

Some Libraries...

Feature Engineering

Decompose

Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.

Discretization

Continuous Features

Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).

Categorical Features

Values for categorical features may be combined, particularly when there’s few samples for some categories.

Reframe Numerical Quantities

Changing from grams to kg, and losing detail might be both wanted and efficient for calculation

Crossing

Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.