Skip to main content

Data preparation : Removing duplicates

Duplicate values not only increase size of dataset but also create bias while training model. Duplicate cases are over weighted thus create bias in training  machine learning model. suppose there is customer data and one customer showed up 100 times and others just showed up only ones this defiantly confuse the model resulting in yielding wrong predictions  

Steps to remove duplicate values---
  1. Explore data and identify duplicate values(use exploratory analysis)[to learn about exploring data check out other posts on the blog].
        identify duplicates cases using ---
            -Unique id:-- if we are lucky enough we have given each entity with unique id. then it it bit easy to                                     remove duplicates    .
            -By value:-- identify using some value such as last name or address keep in mind there could be                                       two or more people with same last name or address .
       2.Removal strategy ---
                -keep most recent(or oldest):-- if you are keeping the recorded of customer last visited bank                         then keep the most recent date . if you are keeping record of most  when the acc was created                      then keep the oldest date .
                -Keep first
                -Keep last 
there is no any magic or formula that tells this strategy is best and other is worst you have to think through that which strategy will work for you after analysing the data.

Comments

Popular posts from this blog

Feature Engineering is easy ..!!

Features are input to machine learning algorithm. and the row features that are provided to us are not the best, they are just whatever just has to be in data.  Feature engineering have a goal to convert given features into much more predictive ones so that the can predict label more precisely . feature engineering can make simple algorithm to give good results. at a same time if you apply the best algorithm and do not perform feature engineering well you are going to get poor results.  feature engineering is a broad subject people dedicate their entire careers to feature engineering. there are some steps in feature engg that we need to follow and repeat most of the times to get job done. steps--- 1. Explore and understand data relationships 2. Transform feature  3.Compute new features from other by applying some  maths on it 4. Visualization to check results   5. Test with ML model 6. Repeat above steps as needed Transforming feature--- Why transform featu...