《machine-learning-mindmap》 2 Data Processing |
《machine-learning-mindmap》
2 Data Processing
(Daniel Martinez)
Feature Selection
Correlation
Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.
Covariance
A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)
Dimensionality Reduction
Principal Component Analysis (PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Plot the variance per feature and select the features with the largest variance.
Singular Value Decomposition (SVD)
SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.
Importance
Filter Methods
Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.
Correlation
Linear Discriminant Analysis
ANOVA: Analysis of Variance
Chi-Square
Wrapper Methods
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insufficient. AND. The significant computation time when the number of variables is large.
Forward Selection
Backward Elimination
Recursive Feature Ellimination
Genetic Algorithms
Embedded Methods
Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.
Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.
Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.
Feature Encoding
Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this.
Label Encoding
One Hot Encoding
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Feature Normalisation or Scaling
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Methods
Rescaling
The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1].
Standardization
Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.
Scaling to unit length
To scale the components of a feature vector such that the complete vector has length one.
Dataset Construction
Training Dataset
A set of examples used for learning
To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using back-progapation.
Test Dataset
A set of examples used only to assess the performance of a fully-trained classifier
In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further.
Validation Dataset
A set of examples used to tune the parameters of a classifier
In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm
Cross Validation
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
Data Types
Nominal - is for mutual exclusive, but not ordered, categories.
Ordinal - is one where the order matters but not the difference between values.
Interval - is a measurement where the difference between two values is meaningful.
Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0.
Data Exploration
Variable Identification
Identify Predictor (Input) and Target (output) variables.
Next,identify the data type and category of the variables.
Univariate Analysis
Continuous Features
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Categorical Features
Frequency, Histogram
Bi-variate Analysis
Finds out the relationship between two variables.
Scatter Plot
Correlation Plot - Heatmap
Two-way table
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Stacked Column Chart
Chi-Square Test
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Z-Test/ T-Test
ANOVA
Feature Cleaning
Missing values
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Special values
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Outliers
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Obvious inconsistencies
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Feature Imputation
Hot-Deck
The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
Cold-Deck
Selects donors from another dataset to complete missing data.
Mean-substitution
Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
Regression
A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing
Some Libraries...
Feature Engineering
Decompose
Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.
Discretization
Continuous Features
Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).
Categorical Features
Values for categorical features may be combined, particularly when there’s few samples for some categories.
Reframe Numerical Quantities
Changing from grams to kg, and losing detail might be both wanted and efficient for calculation
Crossing
Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.