# 《machine-learning-mindmap》 2 Data Processing （Daniel Martinez）

## Feature Selection

### Correlation

#### Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.

#### Covariance

#### A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)

### Dimensionality Reduction

#### Principal Component Analysis (PCA)

#### Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

#### Plot the variance per feature and select the features with the largest variance.

#### Singular Value Decomposition (SVD)

#### SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.

### Importance

#### Filter Methods

#### Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.

#### Correlation

#### Linear Discriminant Analysis

#### ANOVA: Analysis of Variance

#### Chi-Square

#### Wrapper Methods

#### Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insufficient. AND. The significant computation time when the number of variables is large.

#### Forward Selection

#### Backward Elimination

#### Recursive Feature Ellimination

#### Genetic Algorithms

#### Embedded Methods

#### Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.

#### Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

#### Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.

## Feature Encoding

### Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this.

### Label Encoding

#### One Hot Encoding

#### Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

## Feature Normalisation or Scaling

### Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

### Methods

#### Rescaling

#### The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1].

#### Standardization

#### Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.

#### Scaling to unit length

#### To scale the components of a feature vector such that the complete vector has length one.

## Dataset Construction

### Training Dataset

#### A set of examples used for learning

#### To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using back-progapation.

### Test Dataset

#### A set of examples used only to assess the performance of a fully-trained classifier

#### In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further.

### Validation Dataset

#### A set of examples used to tune the parameters of a classifier

#### In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

### Cross Validation

#### One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

## Data Types

### Nominal - is for mutual exclusive, but not ordered, categories.

### Ordinal - is one where the order matters but not the difference between values.

### Interval - is a measurement where the difference between two values is meaningful.

### Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0.

## Data Exploration

### Variable Identification

#### Identify Predictor (Input) and Target (output) variables. Next,identify the data type and category of the variables.

### Univariate Analysis

#### Continuous Features

#### Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

#### Categorical Features

#### Frequency, Histogram

### Bi-variate Analysis

#### Finds out the relationship between two variables.

#### Scatter Plot

#### Correlation Plot - Heatmap

#### Two-way table

#### Stacked Column Chart

#### Chi-Square Test

#### Z-Test/ T-Test

#### ANOVA

## Feature Cleaning

### Missing values

### Special values

### Outliers

### Obvious inconsistencies

## Feature Imputation

### Hot-Deck

#### The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.

### Cold-Deck

#### Selects donors from another dataset to complete missing data.

### Mean-substitution

#### Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.

### Regression

#### A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing

### Some Libraries...

## Feature Engineering

### Decompose

#### Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.

### Discretization

#### Continuous Features

#### Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).

#### Categorical Features

#### Values for categorical features may be combined, particularly when there’s few samples for some categories.

### Reframe Numerical Quantities

#### Changing from grams to kg, and losing detail might be both wanted and efficient for calculation

### Crossing

#### Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.