Machine Learning Iris
In this blog, we will use some machine learning concept with help of ScikitLearn a Machine Learning Package and Iris dataset which can be loaded from sci-kit learn. we will use numpy to work on the Iris dataset and Matplotlib for Visualization. Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. The data set consists of 50 samples from each of three species of Iris:
- Iris setosa
- Iris virginica
- Iris versicolor
There are Four features or column about the flowe r.
- Sepal length(cm)
- Sepal Width(cm)
- Petal Length(cm)
- Petal Width(cm)
Iris datasets are the basic Machine Learning data. The objective of this post is to find the species of Iris flower of test data using the trained model. we are using the Sklearn python package’s Decision tree.
Import Library and Module
First, we will import the required library and module in the python console. In this machine learning we will use:
-
Numpy: which provides support for more efficient numerical computation
-
Pandas: a convenient library that supports data frames.
-
Matplotlib &Seaborne: for Visualization
-
ScikitLearn: Machine learning tools
Load Iris Data
Now, we will load the iris data from the seaborne’s builtin dataset and print first 5 rows as follow:
|
|
|
|
Lets look at the data
|
|
We have 150 samples and 5 features, including our target feature. we can easily print some summary statistics.
|
|
|
|
The list of the features are :
- sepal length
- sepal width
- petal length
- petal width
Split data into training and test sets
We split the data into training and test sets at the beginning of modelling workflow. Splitting is crucial for getting a realistic estimate of the model’s performance.
First, let’s separate our target (y) features from our input (X) features:
|
|
Now we use the Scikit learn train_test_split function:
|
|
We’ll set aside 30% of the data as a test set for evaluating the model. we also set an arbitrary “random state” so that the program can reproduce our results.
Visualization
Now we will plot the graph to understand the features and the species in data.we are using seaborne and matplotlib to make these graph plots.
|
|
The above graph is scatterplot which is plotted between four features of iris in 12 different plots. In the above picture, we can see the samples formed clusters according to their species.
In next graph, we will plot the 4 features of 3 iris species in barplot:
|
|
|
|
In the above code, we made a new variable piris to make the visualization easier. This picture shows how three species of iris differ on the basis of the four features.
Decision tree
Decision tree algorithm is a simple supervised learning algorithm which is used in regression and classification problems. we will make Decision Tree classifier and fit training data (X_train and y_train) to train the model.
|
|
|
|
After fitting the training data the Decision_tree classifier makes a tree using which classifier will classify the species of test data. The Decision Tree can be created as below.
|
|
We are using the graphviz and dot module to create a dot file which can be visualized using graphviz application. The tree we got is below.
Using the above tree the classifier will classify our test data. Remember the above tree is formed by the classifier using the training data.
Prediction
We will use the ML model to predict the iris species on the test data.
|
|
We passed the X_test data to model get the prediction from our model and saved prediction as y_pred.
Performance
We need to check the performance of our model on the test data. We will use accuracy as the performance measure.
|
|
|
|
This model got accuracy Score of 95.5556 out of 100.
Save the model
We need to save our ML model so that we can use it for deployment or in future use.In python model can be saved as pickle file with .pkl extension.
|
|
We can load this .pkl file as below:
|
|
After loading the model we can use to predict the data as in above section.