Skip to main content

Data visualization for classification

The aim of  Data visualization is some what different for classification as compared to the regression , in classification we have to find how different attributes (numeric and categorical ) are related to  categorical labels or we can say with different categories  of labels.
>>following are some techniques used for Data visualization for classification ----
  1. Visualize class separation using numeric feature ----                                                                             goal of visualizing data for classification is to understand which feature is useful for class separation.                                                                                                                                               we can use bar plot for this  purpose 
     see .. that above code demonstrate how we can plot box plot with different features on x axis and label which is bad credit in this case on y axis . num_col is a list of numeric columns of dataframe named as credit.                                                                                                                                       
     code shown above creates box plots as shown in image for each feature mentioned in num_col.            in above plot 0 on x axis indicates good credit while 1 indicates bad credit. Alternative to box plot we  can use violin plot for shoeing separation in labels.
  2. code for violin plot       above code will create violin plots for each attribute in   num_col.
  3. Visualizing class separation  using categorical features-----                                                        Now its time to visualize ability of categorical features to classify labels. best way to do this is bar plot.
            above code shows how you can draw bar plot side by side for each category for each categorical         feature. first this code declares a list of categorical columns or features. then it creates new columns named dummy to keep count of each category  . then we use for loop to loop over each categorical columns in dataset . after that we use groupby operation to group data by label( in this case"bad_credit") and count number of observations in each group using count method immediately after   groupby operation. then we created a figure and filterd data frame for bad_credit having value 0. and created subplot . similarly we do for bad_credit having value 1 and plot bar plot.
                                                                                                                                                           

Comments

Popular posts from this blog

Feature Engineering is easy ..!!

Features are input to machine learning algorithm. and the row features that are provided to us are not the best, they are just whatever just has to be in data.  Feature engineering have a goal to convert given features into much more predictive ones so that the can predict label more precisely . feature engineering can make simple algorithm to give good results. at a same time if you apply the best algorithm and do not perform feature engineering well you are going to get poor results.  feature engineering is a broad subject people dedicate their entire careers to feature engineering. there are some steps in feature engg that we need to follow and repeat most of the times to get job done. steps--- 1. Explore and understand data relationships 2. Transform feature  3.Compute new features from other by applying some  maths on it 4. Visualization to check results   5. Test with ML model 6. Repeat above steps as needed Transforming feature--- Why transform featu...

Data preparation : Dealing with missing values

Missing values are probably the most common headache you are going to have as machine learning engineer. missing values are the ones who screw the whole algorithm and make model give wired results(predictions). Treating missing values--- 1. Use exploration to detect the missing values -- detecting missing values is crucial because  lot of machine learning models fail because of missing values. 2.Find how are missing values are coded-- missing values could be codded in the data in one or more of following formats.     -NULL     -a string or number--eg.-9999,0,"NA","?"etc. 3. Treatment strategy--     - if some column has lot of missing values then its better to get rid of that column.     - remove row-- suppose very few rows have missing values then remove those rows.     -Forward or backward fill-- sometimes its just better to use fill which work by filling value of  nearest           neighbou...

Data preparation: Treating errors and outliers

Errors are outliers are the those values in dataset which are far away from mean value of that column in which they exist or we can say they are very small or large as compared to other values in that column. suppose we are observing column of price and we see value that is ten times bigger then other values then it could be error or outlier. to identify either it is error or outlier we need to apply some domain knowledge or we can say field knowledge. outliers are mainly caused by variability in measurement or experimental errors. but sometimes they could be useful and help us understand data better if we apply domain knowledge. for example we find a outlier in car price of cars data set then we check its other features and we found  that that car is luxury car then we understand it was outlier but a useful one and it was not just a error. Dealing with errors/outliers--- 1. Detect errors/outliers -- use data exploration to identify errors /outliers 2. methods to identify errors a...