Foundation

lipu yeh
3 min readAug 13, 2020

1. Data Preparation in a Machine Learning

Each predictive modeling project with machine learning is different, but there are common steps performed on each project.
雖然每一個machine learning的modeling project都不一樣,卻是有共同的步驟去完成的。
Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithms.
遇到learning algotithm的問題時,Data preparation是找出發生問題原因的最好方法
The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.
在資料準備前後的步驟,會影響你該使用的資料準備的方法

1.1 Tutorial Overview

1.2 Applied Machine Learning Process
要進行一個project的步驟有很多種名稱,例如the applied machine learning process, data science process, ...。我們把這些步驟簡化成4步

Step 1: Define Problem.
Step 2: Prepare Data.
Step 3: Evaluate Models.
Step 4: Finalize Model.

1.3 What Is Data Preparation
* Machine learning algorithms require data to be numbers.
* Some machine learning algorithms impose requirements on the data.
* Statistical noise and errors in the data may need to be corrected.
* Complex nonlinear relationships may be teased out of the data.

資料要是數字
有些algorithm要求特殊的格式
要消掉統計上的雜訊跟錯誤
找出資料的非線性關係

這樣的過程叫做data preparation, data wrangling, data cleaning, data pre-processing and feature engineering, ...
主要有五個工作,Data Cleaning, Feature Selection, Data Transforms,
Feature Engineering, Dimensionality Reduction
.

1.4 How to Choose Data Preparation Techniques
沒有一個最好的方法,這個方法通常要依據你選的model以及特徵跟結果的關係。
* Gather data from the problem domain.
* Discuss the project with subject matter experts.
* Select those variables to be used as inputs and outputs for a predictive model.
* Review the data that has been collected.
* Summarize the collected data using statistical methods.
* Visualize the collected data using plots and charts.
所以會用下面的方式來完成
* Select a performance metric for evaluating model predictive skill.
* Select a model evaluation procedure.
* Select algorithms to evaluate.
* Tune algorithm hyperparameters.
* Combine predictive models into ensembles.

1.5 Further Reading

1.6 Summary

2. Why Data Preparation is So Important

廢話有點多,先跳過
例如80%的時間花在data preparation,如果沒有好的preparation就沒有好結果,...

3. Tour of Data Preparation Techniques

* Techniques such as data cleaning can identify and fix errors in data like missing values.
* Data transforms can change the scale, type, and probability distribution of variables in the dataset.
* Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Data Cleaning (資料、數據清洗)

移除重複資料跟特徵
處理異常值
處理缺失值

Feature Selection (挑選特徵)

非監督式(?)
監督式
* 降維

Data Transforms (資料轉換)

數值
絕對(有限的)
Reference:
* https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 (比較多細節介紹,還有一張決定方法的路線圖)

資料類型
轉換方式

Dimensionality Reduction

4. Data Preparation Without Data Leakage

處理資料時不小心把結果相關的資訊放到訓練資料集了,會早成訓練出來的model不準確(變好或是變壞都有可能)。

Test的資料洩漏到Train
Future的資料洩漏到Past
* 因為不同時間資料的差異導致的洩漏
e.x. 猜測一個人會用手機還是電話卻用到沒有手機的調查資料
* 因為對全部資料做處理導致的洩漏(解法,用training set調整data,apply到train data跟test data)
e.x. nomarlize後,model知道最大值是1

k-fold要用pipeline來解決data leakage

--

--