K-Means Clustering Algorithm

Introduction

K-Means Clustering is an unsupervised machine learning algorithm which is used to solve clustering problems in machine learning.

K-Means Clustering is an unsupervised machine learning algorithm, which groups the unlabeled features or dataset into different numbers of clusters. Here K defines the number of clusters which are created, if K=3, it means that there will be three clusters, and for K=4, there will be four clusters, and so on.



How does K-Means Clustering Algorithm Work in ML ?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Define the cluster value K because based on the K, numbers of clusters will be created.

Step-2: Select random K points or centroids.

Step-3: Assign each data point to their closest centroid, which will form the K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the step third, which means re-assign each data point to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step fourth else go to FINISH.

Step-7: Now the model is ready to perform.


We use the Elbow Method to select the value of K. The Elbow method is one of the most popular algorithms to find the optimal or least number of clusters. This method uses the concept of WCSS (Within Cluster Sum of Squares) value  which defines the total variations within a cluster.

Since the below graph shows the sharp bend, which is like an elbow of the hand, hence it’s known as the elbow method. The below image is the graph for the elbow method.



Now we’ll discuss the practical implementation of the K-means algorithm.

Before starting the implementation, we have the dataset of Mall Customers. This dataset is for the customers who visit the mall and spend time in the mall. We have some features in the dataset like CustomerID, Gender, Age, Annual Income (k$), Spending Score (1-100). But we will use two features Annual Income (k$) and Spending Score (1-100) to solve the unsupervised machine learning problem.


Let’s start :-

Firstly, we will import the libraries for our model like pandas, numpy, etc., which is part of data pre-processing. Then, we’ll import the dataset that we need to use to train the model. So here, we are using the Mall_Customer.csv dataset.



In unsupervised algorithm problems, we don't need any dependent feature for the data pre-processing step, because it’s a clustering problem. Then, we’ll extract our two features Annual Income (k$) and Spending Score (1-100) to solve the clustering problem.



Now we’ll use the Elbow Method to find the optimal number of clusters as we discussed above. As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the values of WCSS for different K values ranging from 1 to 10.


Then we have used the for loop for iteration on the different values of K ranging from 1 to 10. After executing the code, we got output like in the image below and in the image below, we can see that the elbow point is at 5. So the number of clusters will be 5.



Now we’ll train the model with K=5 to solve the unsupervised problem.


Now the last step is to visualize the clusters. As we have 5 clusters, we will visualize each cluster one by one. To visualize the clusters, we’ll use a scatter plot method of matplotlib using plt.scatter() function. Now you can see the below output, data has been categorized into 5 clusters.



Source Code :

  1. Go to GitHub and download or fork the repo :
  2. Then open .ipnyb file in jupyter notebook.

Video Tutorial

Thank You !!!!!!!!!!



If you have any doubts, Please let me know

Post a Comment (0)
Previous Post Next Post