In previous post i have said that duplicate values create biased while training model and have explained how to deal with duplicate values step by step. if you have not read that post make sure you read that (link--https://www.blogger.com/blog/post/edit/950003373928384423/4720144211787135428) as this post going to be practical implementation of theory explained in that post . or just continue reading as i shall explain while showing how it actually work in practical.
Steps---
1. finding duplicate values through data exploration ---
before we start finding duplicates we need to first import dataset and set columns so ones data is imported we can start exploring it use dtypes, describe and other methods to explore data
see the image below and observe carefully you will find that we have printed two shapes first one being shape of entire data frame and the other one being shape of unique customer_id . and what we found is that there are 12 observations more in data frame as compared to unique values in customer id column.
2. Now as we have find out there are 12 duplicates we just need to remove them pandas makes it too easy to remove duplicates. we just have to apply dropduplicates method and we are done that easy. remember . strategy we are using is keeping the first value and dropping all other values.
Comments
Post a Comment