Data preparation : Scaling

Scaling is very important operation for machine learning . improper scaling creates bias in model. we don't want to have is data column that have very lager value range for example we have people age and salary we see that salary of a person is going to be much grater then the age hence it will create bias in model.

There are several methods used for scaling---

most common ones are--

1. Z score--

taking z score usually means normalizing the values with mean=0, standard deviation=1

2. min max -if distribution is far from normal then min max method is used

remember deal with missing values and errors and outliers before scaling because they create bias.

Comments

Feature Engineering is easy ..!!

Features are input to machine learning algorithm. and the row features that are provided to us are not the best, they are just whatever just has to be in data. Feature engineering have a goal to convert given features into much more predictive ones so that the can predict label more precisely . feature engineering can make simple algorithm to give good results. at a same time if you apply the best algorithm and do not perform feature engineering well you are going to get poor results. feature engineering is a broad subject people dedicate their entire careers to feature engineering. there are some steps in feature engg that we need to follow and repeat most of the times to get job done. steps--- 1. Explore and understand data relationships 2. Transform feature 3.Compute new features from other by applying some maths on it 4. Visualization to check results 5. Test with ML model 6. Repeat above steps as needed Transforming feature--- Why transform featu...

Data preparation : Dealing with missing values

Missing values are probably the most common headache you are going to have as machine learning engineer. missing values are the ones who screw the whole algorithm and make model give wired results(predictions). Treating missing values--- 1. Use exploration to detect the missing values -- detecting missing values is crucial because lot of machine learning models fail because of missing values. 2.Find how are missing values are coded-- missing values could be codded in the data in one or more of following formats. -NULL -a string or number--eg.-9999,0,"NA","?"etc. 3. Treatment strategy-- - if some column has lot of missing values then its better to get rid of that column. - remove row-- suppose very few rows have missing values then remove those rows. -Forward or backward fill-- sometimes its just better to use fill which work by filling value of nearest neighbou...

Data preparation : Removing duplicates

Duplicate values not only increase size of dataset but also create bias while training model. Duplicate cases are over weighted thus create bias in training machine learning model. suppose there is customer data and one customer showed up 100 times and others just showed up only ones this defiantly confuse the model resulting in yielding wrong predictions Steps to remove duplicate values --- Explore data and identify duplicate values (use exploratory analysis)[to learn about exploring data check out other posts on the blog]. identify duplicates cases using --- -Unique id:-- if we are lucky enough we have given each entity with unique id. then it it bit easy to remove duplicates . ...

AI hub

Search This Blog