Skip to main content

Posts

Showing posts from May, 2020

Feature Engineering is easy ..!!

Features are input to machine learning algorithm. and the row features that are provided to us are not the best, they are just whatever just has to be in data.  Feature engineering have a goal to convert given features into much more predictive ones so that the can predict label more precisely . feature engineering can make simple algorithm to give good results. at a same time if you apply the best algorithm and do not perform feature engineering well you are going to get poor results.  feature engineering is a broad subject people dedicate their entire careers to feature engineering. there are some steps in feature engg that we need to follow and repeat most of the times to get job done. steps--- 1. Explore and understand data relationships 2. Transform feature  3.Compute new features from other by applying some  maths on it 4. Visualization to check results   5. Test with ML model 6. Repeat above steps as needed Transforming feature--- Why transform feature ?-- 1. To improve distribu

Data preparation: Dealing with duplicates in data(practical example)

In previous post i have said that duplicate values create biased while training model  and have explained how to deal with duplicate values step by step. if you have not read that post make sure you read that (link-- https://www.blogger.com/blog/post/edit/950003373928384423/4720144211787135428 ) as this post going to be practical implementation of theory explained in that post . or just continue reading as i shall explain while showing how it actually work in practical. Steps-- - 1. finding duplicate values through data exploration ---     before we start finding duplicates we need to first import dataset and set columns     so ones data is imported we can start exploring it use dtypes, describe and other methods to explore data   see the image below and observe carefully you will find that we have printed two shapes first one being      shape of  entire data frame and the other one being shape of unique customer_id . and what we found      is    that there are 12  observations more in

Data preparation : Scaling

Scaling is very important operation for machine learning . improper scaling creates bias in model. we don't want to have is data column that have very lager value range for example  we have  people age and salary we see that salary of a person is going to be much grater then the age hence it will create bias in model. There are several methods used for scaling--- most common ones are-- 1. Z score--     taking z score usually means normalizing the values with mean=0, standard deviation=1 2. min max -if distribution is far from normal then min max method is used remember deal with missing values and errors and outliers before  scaling because they create bias.

Data preparation: Treating errors and outliers

Errors are outliers are the those values in dataset which are far away from mean value of that column in which they exist or we can say they are very small or large as compared to other values in that column. suppose we are observing column of price and we see value that is ten times bigger then other values then it could be error or outlier. to identify either it is error or outlier we need to apply some domain knowledge or we can say field knowledge. outliers are mainly caused by variability in measurement or experimental errors. but sometimes they could be useful and help us understand data better if we apply domain knowledge. for example we find a outlier in car price of cars data set then we check its other features and we found  that that car is luxury car then we understand it was outlier but a useful one and it was not just a error. Dealing with errors/outliers--- 1. Detect errors/outliers -- use data exploration to identify errors /outliers 2. methods to identify errors and o

Data preparation : Dealing with missing values

Missing values are probably the most common headache you are going to have as machine learning engineer. missing values are the ones who screw the whole algorithm and make model give wired results(predictions). Treating missing values--- 1. Use exploration to detect the missing values -- detecting missing values is crucial because  lot of machine learning models fail because of missing values. 2.Find how are missing values are coded-- missing values could be codded in the data in one or more of following formats.     -NULL     -a string or number--eg.-9999,0,"NA","?"etc. 3. Treatment strategy--     - if some column has lot of missing values then its better to get rid of that column.     - remove row-- suppose very few rows have missing values then remove those rows.     -Forward or backward fill-- sometimes its just better to use fill which work by filling value of  nearest           neighbours  in null cell. it is useful when data is in some order say in order of

Data preparation : Removing duplicates

Duplicate values not only increase size of dataset but also create bias while training model. Duplicate cases are over weighted thus create bias in training  machine learning model. suppose there is customer data and one customer showed up 100 times and others just showed up only ones this defiantly confuse the model resulting in yielding wrong predictions   Steps to remove duplicate values --- Explore data and identify duplicate values (use exploratory analysis)[to learn about exploring data check out other posts on the blog].           identify duplicates cases using ---               -Unique id:-- if we are lucky enough we have given each entity with unique id. then it it bit easy to                                              remove duplicates     .               -By value:-- identify using some value such as last name or address keep in mind there could be                                       two or more people with same last name or address  .        2 . Removal strategy - --

Learn Data preparation steps for machine learning just in 5 min..!!!

Machine learning is hot topic now day and will be in future. data preparation is key when it comes to machine learning. with good data preparation a simple ml model can give very good and satisfying results but if data is not well prepared and you use very good quality/sophisticated ml algorithm(with good prediction precision) it is going to fail ,your model will just take garbage input and give out garbage output. keep in mind data preparation often makes more difference then the algorithm it self. Ultimate goal of data preparation is to insure that machine learning model  works optimal way. Steps of data preparation --- Explore to understand problems in data---there are many methods are used to explore data hrad(), tail(),shape(),describe() are just example of such methods ,and also we can use plots for this purpose. Remove duplicates Treat missing values Treat errors and outliers Treat null values scale the features  split dataset now understand that we don't just do these step

What Frequency Tables for categorical values in dataframe ?

Frequency tables are tables that shows how frequently various categories of categorical variables occur in data and how many different categories are there and which are those categories,  it is also useful for classification to find the frequency of each category of label variable(column) .  this help us to separate helpful categories from not so helpful categories.  suppose some category in categorical variable occurs just ones or twice  then it is not going to be helpful from statistical point of view . Lets see how to make frequency tables--- first i have downloaded auto_prices data set ,then i have taken out come categorical columns and created list of those columns ,this list along with dataset is passed to the count unique function . that function simply loop through each column in the list and  counts  number of times each unique value occurs in  column and finally prints the same. above code gives following frequency table--- Examining classes and class  imbalances---- for cl

Data visualization for classification

The aim of  Data visualization is some what different for classification as compared to the regression , in classification we have to find how different attributes (numeric and categorical ) are related to  categorical labels or we can say with different categories  of labels. >>following are some techniques used for Data visualization for classification ---- Visualize class separation using numeric feature - ---                                                                             goal of visualizing data for classification is to understand which feature is useful for class separation.                                                                                                                                               we can use bar plot for this  purpose   see .. that above code demonstrate how we can plot box plot with different features on x axis and label which is bad credit in this case on y axis . num_col is a list of numeric columns of dataframe named as cred

Learn which Are The Best Plots For Data Visualization in just 5 min ?

Python is a cool language when it comes to machine learning. Data visualization is the keys to building best machine learning model. And Without  data visualization machine learning is just a waste of time  data visualization helps us understand data and relationships between features. Here we will learn which are the best plots to visualize what type of data. Data is basically of two two types--- 1.Numeric 2.Categorical  1. Draw distribution for single feature -- If the feature is of categorical type then its better to use bar plots. in diagram given below you can see the different company names that produces automobiles these company names are categorical and graph shows which company manufactured how many autos    If the feature is  numerical its better to draw histogram with bins. in diagram below you might see that engine size is plotted against number of autos . notice that engine size is a numerical feature if the feature is numerical then there are two more things we can do to