《machine-learning-mindmap》 2 Data Processing （Daniel Martinez）-ZhiMap思维导图

《machine-learning-mindmap》
2 Data Processing
（Daniel Martinez）
进入思维导图模式
- 返回总图
- Feature Selection
  - Correlation
    - Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.
      - Covariance
        A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)
  - Dimensionality Reduction
    - Principal Component Analysis (PCA)
      - Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
        Plot the variance per feature and select the features with the largest variance.
    - Singular Value Decomposition (SVD)
      - SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.
  - Importance
    - Filter Methods
      - Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.
        Correlation
        Linear Discriminant Analysis
        ANOVA: Analysis of Variance
        Chi-Square
    - Wrapper Methods
      - Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insufficient. AND. The significant computation time when the number of variables is large.
        Forward Selection
        Backward Elimination
        Recursive Feature Ellimination
        Genetic Algorithms
    - Embedded Methods
      - Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.
        Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.
        Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.
- Feature Encoding
  - Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this.
  - Label Encoding
    - One Hot Encoding
      - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
- Feature Normalisation or Scaling
  - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
  - Methods
    - Rescaling
      - The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1].
    - Standardization
      - Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.
    - Scaling to unit length
      - To scale the components of a feature vector such that the complete vector has length one.
- Dataset Construction
  - Training Dataset
    - A set of examples used for learning
      - To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using back-progapation.
  - Test Dataset
    - A set of examples used only to assess the performance of a fully-trained classifier
      - In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further.
  - Validation Dataset
    - A set of examples used to tune the parameters of a classifier
      - In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm
  - Cross Validation
    - One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
- Data Types
  - Nominal - is for mutual exclusive, but not ordered, categories.
  - Ordinal - is one where the order matters but not the difference between values.
  - Interval - is a measurement where the difference between two values is meaningful.
  - Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0.
- Data Exploration
  - Variable Identification
    - Identify Predictor (Input) and Target (output) variables.
      Next,identify the data type and category of the variables.
  - Univariate Analysis
    - Continuous Features
      - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
    - Categorical Features
      - Frequency, Histogram
  - Bi-variate Analysis
    - Finds out the relationship between two variables.
    - Scatter Plot
    - Correlation Plot - Heatmap
    - - Two-way table
        Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
      - Stacked Column Chart
      - Chi-Square Test
        Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
      - Z-Test/ T-Test
      - ANOVA
- Feature Cleaning
  - Missing values
    - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
  - Special values
    - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
  - Outliers
    - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
  - Obvious inconsistencies
    - Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
- Feature Imputation
  - Hot-Deck
    - The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
  - Cold-Deck
    - Selects donors from another dataset to complete missing data.
  - Mean-substitution
    - Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
  - Regression
    - A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing
  - Some Libraries...
- Feature Engineering
  - Decompose
    - Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.
  - Discretization
    - Continuous Features
      - Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).
    - Categorical Features
      - Values for categorical features may be combined, particularly when there’s few samples for some categories.
  - Reframe Numerical Quantities
    - Changing from grams to kg, and losing detail might be both wanted and efficient for calculation
  - Crossing
    - Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.

《machine-learning-mindmap》2 Data Processing（Daniel Martinez）

返回总图

Feature Selection

Correlation

Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.

Covariance

A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)

Dimensionality Reduction

Principal Component Analysis (PCA)

Plot the variance per feature and select the features with the largest variance.

Singular Value Decomposition (SVD)

Importance

Filter Methods

Correlation

Linear Discriminant Analysis

ANOVA: Analysis of Variance

Chi-Square

Wrapper Methods

Forward Selection

Backward Elimination

Recursive Feature Ellimination

Genetic Algorithms

Embedded Methods

Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.

Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.

Feature Encoding

Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this.

Label Encoding

One Hot Encoding

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Feature Normalisation or Scaling

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Methods

Rescaling

The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1].

Standardization

Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.

Scaling to unit length

To scale the components of a feature vector such that the complete vector has length one.

Dataset Construction

Training Dataset

A set of examples used for learning

To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using back-progapation.

Test Dataset

A set of examples used only to assess the performance of a fully-trained classifier

In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further.

Validation Dataset

A set of examples used to tune the parameters of a classifier

In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

Cross Validation

Data Types

Nominal - is for mutual exclusive, but not ordered, categories.

Ordinal - is one where the order matters but not the difference between values.

Interval - is a measurement where the difference between two values is meaningful.

Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0.

Data Exploration

Variable Identification

Identify Predictor (Input) and Target (output) variables. Next,identify the data type and category of the variables.

Univariate Analysis

Continuous Features

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Categorical Features

Frequency, Histogram

Bi-variate Analysis

Finds out the relationship between two variables.

Scatter Plot

Correlation Plot - Heatmap

Two-way table

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Stacked Column Chart

Chi-Square Test

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Z-Test/ T-Test

ANOVA

Feature Cleaning

Missing values

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Special values

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

《machine-learning-mindmap》
2 Data Processing
（Daniel Martinez）

Identify Predictor (Input) and Target (output) variables.
Next,identify the data type and category of the variables.