In this tutorial, we are going to implement our salary prediction using the machine learning algorithm. This model predicts the salary of the employee based on the year of experience of employee. This is a regression problem which is solved using a LinearRegression. This model implemented by using the following steps such as :-

Import all required libraries

load data

perform EDA

visualize data

prepare data

split data into training and testing

define model

test model

check accuracy

save the model

I used the LinearRegression model which gives me a 97% accuracy on the testing data. This is a simple LinearRegression model which can be easily understandable by beginners. Just read and explore the blog for a complete explanation about the project.

Let’s start :-

First of all import all the required libraries like pandas, numpy, seaborn, sklearn, matplotlib etc. And load the data (Salary.csv file) to train the model and test the model and show the top five rows of employee data. Data stores two features: YearExperience and Salary. Take the year of experience as an input feature and the salary feature as an output.

Now perform an Exploratory Data Analysis. In Exploratory Data Analysis, firstly we check that there are Null values present or not, then check the information of the data, then describe the data which shows the mean value, standard deviation value, minimum value, Maximum value etc.

Now visualize the data YearExperience and Salary using the matlab function scatter.

Now it's time to prepare the data, divide the data into the independent and dependent features. X stores the independent feature (YearExperience) and y stores the dependent feature (Salary).

Then Split the data into the training and testing using the train_test_split function which takes some of the parameters like X, y, random_state, test_size. X is an independent feature and y is the dependent feature, random_state used for randomly selecting the data and test_ size used for dividing the data into the training and testing. Example: if test_size is 20% then automatically training size is 80%. train_test_split function return four parameters are X_train, X_test, y_train, y_test. X_train stores the independent feature and y_train stores the dependent feature and these both X_train and y_train are used for train the model. X_test stores the independent feature and y_test stores the dependent feature and these both X_test and y_test are used for test the model and evaluate the model.

Now define the LinearRegression model with by default parameters and trained LinearRegression model with training data ( X_train and Y_train ). And test the model using the testing data (X_test). and display the predicted and actual data.

Now calculate the difference between the actual salary value and the predicted salary value and make a DataFrame and show the data of actual salary, predicted salary and the difference between the actual salary and predicted salary.

Now visualize the training data, draw the best fit line and Plot all the training points of the training data and see the bias. Bias is the difference between the best fit line and the training point. This difference is called the Bias (error).

Now visualize the testing data, draw the best fit line and Plot all the testing points of the testing data and see the bias.

Check the accuracy of the model which is approximately 98% accuracy on the testing data and also check the mean squared error and r2_score using the actual data and predicted data.

Now the last step is to test on the custom data so I gave 3 years of experience to my model and check what is the salary of the 3 years of experienced employee. My model predicts that the salary is 54851 thousand.

Then again I gave 5 years of experience to my model and my model predicts that the salary is 72492 thousand.