Duplicate values not only increase size of dataset but also create bias while training model. Duplicate cases are over weighted thus create bias in training machine learning model. suppose there is customer data and one customer showed up 100 times and others just showed up only ones this defiantly confuse the model resulting in yielding wrong predictions
Steps to remove duplicate values---
- Explore data and identify duplicate values(use exploratory analysis)[to learn about exploring data check out other posts on the blog].
identify duplicates cases using ---
-Unique id:-- if we are lucky enough we have given each entity with unique id. then it it bit easy to remove duplicates .
-By value:-- identify using some value such as last name or address keep in mind there could be two or more people with same last name or address .
2.Removal strategy ---
-keep most recent(or oldest):-- if you are keeping the recorded of customer last visited bank then keep the most recent date . if you are keeping record of most when the acc was created then keep the oldest date .
-Keep first
-Keep last
Comments
Post a Comment