Skip to main content

Posts

Showing posts with the label Exploratory data analysis

Data preparation : Dealing with missing values

Missing values are probably the most common headache you are going to have as machine learning engineer. missing values are the ones who screw the whole algorithm and make model give wired results(predictions). Treating missing values--- 1. Use exploration to detect the missing values -- detecting missing values is crucial because  lot of machine learning models fail because of missing values. 2.Find how are missing values are coded-- missing values could be codded in the data in one or more of following formats.     -NULL     -a string or number--eg.-9999,0,"NA","?"etc. 3. Treatment strategy--     - if some column has lot of missing values then its better to get rid of that column.     - remove row-- suppose very few rows have missing values then remove those rows.     -Forward or backward fill-- sometimes its just better to use fill which work by filling value of  nearest           neighbou...

Data preparation : Removing duplicates

Duplicate values not only increase size of dataset but also create bias while training model. Duplicate cases are over weighted thus create bias in training  machine learning model. suppose there is customer data and one customer showed up 100 times and others just showed up only ones this defiantly confuse the model resulting in yielding wrong predictions   Steps to remove duplicate values --- Explore data and identify duplicate values (use exploratory analysis)[to learn about exploring data check out other posts on the blog].           identify duplicates cases using ---               -Unique id:-- if we are lucky enough we have given each entity with unique id. then it it bit easy to                                              remove duplicates     .  ...

Learn Data preparation steps for machine learning just in 5 min..!!!

Machine learning is hot topic now day and will be in future. data preparation is key when it comes to machine learning. with good data preparation a simple ml model can give very good and satisfying results but if data is not well prepared and you use very good quality/sophisticated ml algorithm(with good prediction precision) it is going to fail ,your model will just take garbage input and give out garbage output. keep in mind data preparation often makes more difference then the algorithm it self. Ultimate goal of data preparation is to insure that machine learning model  works optimal way. Steps of data preparation --- Explore to understand problems in data---there are many methods are used to explore data hrad(), tail(),shape(),describe() are just example of such methods ,and also we can use plots for this purpose. Remove duplicates Treat missing values Treat errors and outliers Treat null values scale the features  split dataset now understand that we don't just do these...

What Frequency Tables for categorical values in dataframe ?

Frequency tables are tables that shows how frequently various categories of categorical variables occur in data and how many different categories are there and which are those categories,  it is also useful for classification to find the frequency of each category of label variable(column) .  this help us to separate helpful categories from not so helpful categories.  suppose some category in categorical variable occurs just ones or twice  then it is not going to be helpful from statistical point of view . Lets see how to make frequency tables--- first i have downloaded auto_prices data set ,then i have taken out come categorical columns and created list of those columns ,this list along with dataset is passed to the count unique function . that function simply loop through each column in the list and  counts  number of times each unique value occurs in  column and finally prints the same. above code gives following frequency table--- Examining classe...