Errors are outliers are the those values in dataset which are far away from mean value of that column in which they exist or we can say they are very small or large as compared to other values in that column. suppose we are observing column of price and we see value that is ten times bigger then other values then it could be error or outlier. to identify either it is error or outlier we need to apply some domain knowledge or we can say field knowledge. outliers are mainly caused by variability in measurement or experimental errors. but sometimes they could be useful and help us understand data better if we apply domain knowledge. for example we find a outlier in car price of cars data set then we check its other features and we found that that car is luxury car then we understand it was outlier but a useful one and it was not just a error.
Dealing with errors/outliers---
1. Detect errors/outliers -- use data exploration to identify errors /outliers
2. methods to identify errors and outliers cases--
- use statistics(eg. like value count if the column is categorical )
- use Visualization/plots (scatter plot ,histograms etc)
3. Treatment strategies--
now lets say you hunted down some errors and outliers its time to take care of them.
- Limit to min -max range-- set range of max and min on values in column
- same methods used for missing values (to check those--https://www.blogger.com/blog/post/edit/950003373928384423/6745611066831558657)
Comments
Post a Comment