Tip of the Day

Do not go where the path may lead, go instead where there is no path and leave a trail.

Tutorial:Random forest model using Scikit-Learn

Tutorial:Random forest model using Scikit-Learn

Step 1 : Import libraries and modules

In [1]:
import numpy as np
In [2]:
import pandas as pd
In [3]:
from sklearn.model_selection import train_test_split
In [4]:
# Next, we'll import the entire preprocessing module. This contains utilities for scaling, transforming,and wrangling data.
from sklearn import preprocessing
In [5]:
#Import random forest model
from sklearn.ensemble import RandomForestRegressor
In [6]:
#we'll only focus on training a random forest and tuning its parameters.Let's move on to importing the tools to help us perform cross-validation.
from sklearn.pipeline import make_pipeline
In [7]:
#Next, let's import some metrics we can use to evaluate our model performance later.
#Import evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
In [8]:
#And finally, we'll import a way to persist our model for future use.
#Import module for saving scikit-learn models
from sklearn.externals import joblib
Note :joblib is an alternative to Python's pickle package, and we'll use it because it's more efficient for storing large numpy arrays.

Step 2: Load red wine data.

The Pandas library that we imported is loaded with a whole suite of helpful import/output tools. You can read data from CSV, Excel, SQL, SAS, and many other data formats. Here's a list of all the Pandas IO tools (http://pandas.pydata.org/pandas-docs/stable/io.html). The convenient tool we'll use today is the read_csv() function. Using this function, we can load any CSV file, even from a remote URL!
In [13]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)
In [15]:
print (data.head())
  fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                                                                                     
1   7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5                                                                                                                     
2  7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...                                                                                                                     
3  11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...                                                                                                                     
4   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                                                                                     
In [16]:
data = pd.read_csv(dataset_url, sep=';')
print(data.head())
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5  
In [17]:
print(data.shape)
(1599, 12)
In [18]:
print(data.describe())
       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000              6.000000     0.990070   
25%       0.070000             7.000000             22.000000     0.995600   
50%       0.079000            14.000000             38.000000     0.996750   
75%       0.090000            21.000000             62.000000     0.997835   
max       0.611000            72.000000            289.000000     1.003690   

                pH    sulphates      alcohol      quality  
count  1599.000000  1599.000000  1599.000000  1599.000000  
mean      3.311113     0.658149    10.422983     5.636023  
std       0.154386     0.169507     1.065668     0.807569  
min       2.740000     0.330000     8.400000     3.000000  
25%       3.210000     0.550000     9.500000     5.000000  
50%       3.310000     0.620000    10.200000     6.000000  
75%       3.400000     0.730000    11.100000     6.000000  
max       4.010000     2.000000    14.900000     8.000000  
Here's the list of all the features: quality (target) fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol All of the features are numeric, which is convenient. However, they have some very different scales, so let's make a mental note to standardize the data later. As a reminder, for this tutorial, we're cutting out a lot of exploratory data analysis we'd typically recommend. For now, let's move on to splitting the data.

Step 3: Split data into training and test sets

Splitting the data into training and test sets at the beginning of your modeling workflow is crucial for getting a realistic estimate of your model's performance. First, let's separate our target (y) features from our input (X) features:
In [19]:
#Separate target from training features
y = data.quality
X = data.drop('quality', axis=1)
This allows us to take advantage of Scikit-Learn's useful train_test_split function:
In [20]:
#Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=123,
stratify=y)
As you can see, we'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (a.k.a. seed) so that we can reproduce our results. Finally, it's good practice to stratify your sample by the target variable. This will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.

Step 4: Data Processing

Remember, in Step 2, we made the mental note to standardize our features because they were on different scales.
What is standardization?
Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.
So instead of directly invoking the scale function, we'll be using a feature in Scikit-Learn called the Transformer API. The Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model and then use the same transformation on future data sets!
Here's what that process looks like:
  1. Fit the transformer on the training set (saving the means and standard deviations)
  2. Apply the transformer to the training set (scaling the training data)
  3. Apply the transformer to the test set (using the same means and standard deviations) This makes your final estimate of model performance more realistic, and it allows to insert your preprocessing steps into a cross-validation pipeline (more on this in Step 6).
Here's how you do it:
In [21]:
#Fitting the Transformer API
scaler = preprocessing.StandardScaler().fit(X_train)
Now, the scaler object has the saved means and standard deviations for each feature in the training set. Let's confirm that worked:
In [22]:
#Applying transformer to training data
X_train_scaled = scaler.transform(X_train)
print (X_train_scaled.mean(axis=0))
[  1.16664562e-16  -3.05550043e-17  -8.47206937e-17  -2.22218213e-17
   2.22218213e-17  -6.38877362e-17  -4.16659149e-18  -2.54439854e-15
  -8.70817622e-16  -4.08325966e-16  -1.17220107e-15]
In [24]:
print (X_train_scaled.std(axis=0))
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
In [25]:
#Applying transformer to test data
X_test_scaled = scaler.transform(X_test)
print (X_test_scaled.mean(axis=0))
print (X_test_scaled.std(axis=0))
[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
[ 1.02160495  1.00135689  0.97456598  0.91099054  0.86716698  0.94193125
  1.03673213  1.03145119  0.95734849  0.83829505  1.0286218 ]
In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the Transformer API. Instead, we'll simply declare the class object, like so:
In [26]:
#Pipeline with preprocessing and model
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))

Step 5 :Hyperparameters to tune.

Now it's time to consider the hyperparameters that we'll want to tune for our model. WTF are hyperparameters? There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot. Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model. Example: random forest hyperparameters. As an example, let's take our random forest for regression: X_train_scaled = scaler.transform(X_train) print X_train_scaled.mean(axis=0)
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
print X_train_scaled.std(axis=0)
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
X_test_scaled = scaler.transform(X_test) print X_test_scaled.mean(axis=0)
[ 0.02776704 0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
-0.02414174 -0.00293273 -0.00467444 -0.10894663 0.01043391]
print X_test_scaled.std(axis=0)
[ 1.02160495 1.00135689 0.97456598 0.91099054 0.86716698 0.94193125
1.03673213 1.03145119 0.95734849 0.83829505 1.0286218 ]
pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100)) Applying transformer to test data Python Pipeline with preprocessing and model Python Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are model parameters.
However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of hyperparameters that the user must set. We can list the tunable hyperparameters like so:
In [27]:
print (pipeline.get_params())
{'memory': None, 'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))], 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'randomforestregressor': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'randomforestregressor__bootstrap': True, 'randomforestregressor__criterion': 'mse', 'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'auto', 'randomforestregressor__max_leaf_nodes': None, 'randomforestregressor__min_impurity_decrease': 0.0, 'randomforestregressor__min_impurity_split': None, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__min_weight_fraction_leaf': 0.0, 'randomforestregressor__n_estimators': 100, 'randomforestregressor__n_jobs': 1, 'randomforestregressor__oob_score': False, 'randomforestregressor__random_state': None, 'randomforestregressor__verbose': 0, 'randomforestregressor__warm_start': False}
Now, let's declare the hyperparameters we want to tune through cross-validation.
In [28]:
#Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]}
As you can see, the format should be a Python dictionary (data structure for key-value pairs) where keys are the hyperparameter names and values are lists of settings to try. The options for parameter values can be found on the documentation page.

Step 6: Tune model using a cross-validation pipeline.

Now we're almost ready to dive into fitting our models. But first, we need to spend some time talking about cross-validation.
This is one of the most important skills in all of machine learning because it helps you maximize model performance while reducing the chance of overfitting.

What is cross-validation (CV)?

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method. Practically, that "method" is simply a set of hyperparameters in this context. These are the steps for CV:
  1. Split your data into k equal parts, or "folds" (typically k=10).
  2. Train your model on k-1 folds (e.g. the first 9 folds).
  3. Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).
  4. Perform steps (2) and (3) k times, each time holding out a different fold.
  5. Aggregate the performance across all k folds. This is your performance metric.

Why is cross-validation important in machine learning?

Let's say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.
How can you decide?
That's where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness. This allows you to keep your test set "untainted" and save it for a true hold-out evaluation when you're finally ready to select a model. For example, you can use CV to tune a random forest model, a linear regression model, and a knearest neighbors model, using only the training set. Then, you still have the untainted test set to make your final selection between the model families!

So What is a cross-validation "pipeline?"

The best practice when performing CV is to include your data preprocessing steps inside the crossvalidation loop. This prevents accidentally tainting your training folds with influential data from your test fold. Here's how the CV pipeline looks after including preprocessing steps:
  1. Split your data into k equal parts, or "folds" (typically k=10).
  2. Preprocess k-1 training folds.
  3. Train your model on the same k-1 folds.
  4. Preprocess the hold-out fold using the same transformations from step (2).
  5. Evaluate your model on the same hold-out fold.
  6. Perform steps (2) - (5) k times, each time holding out a different fold.
  7. Aggregate the performance across all k folds. This is your performance metric.
Fortunately, Scikit-Learn makes it stupidly simple to set this up:
In [31]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
In [32]:
# Fit and tune model
clf.fit(X_train, y_train)
Out[32]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decr...mators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
Yes, it's really that easy. GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters. It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create. Obviously, there's a lot going on under the hood. We've included the pseudo-code above, and we'll cover writing cross-validation from scratch in a separate guide. Now, you can see the best set of parameters found using CV:
In [33]:
print (clf.best_params_)
{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'auto'}
Interestingly, it looks like the default parameters win out for this data set.
*Tip: It turns out that in practice, random forests don't actually require a lot of tuning. They tend to work pretty well out-of-the-box with a reasonable number of trees. Even so, these same steps can be used when building any type of supervised learning model.

Step 7: Refit on the entire training set.

After you've tuned your hyperparameters appropriately using cross-validation, you can generally get a small performance improvement by refitting the model on the entire training set.
Conveniently, GridSearchCV from sklearn will automatically refit the model with the best set of hyperparameters using the entire training set.
This functionality is ON by default, but you can confirm it:
In [34]:
#Confirm model will be retrained
print(clf.refit)
True
Now, you can simply use the clf object as your model when applying it to other sets of data. That's what we'll be doing in the next step.

Step 8: Evaluate model pipeline on test data

This step is really straightforward once you understand that the clf object you used to tune the hyperparameters can also be used directly like a model object. Here's how to predict a new set of data:
In [35]:
#Predict a new set of data
y_pred = clf.predict(X_test)
Now we can use the metrics we imported earlier to evaluate our model performance.
In [36]:
print (r2_score(y_test, y_pred))
0.46534632847
In [37]:
print (mean_squared_error(y_test, y_pred))
0.3449978125
Great, so now the question is... is this performance good enough? Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance.
  1. Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem?
  2. Look in academic literature to get a sense of the current performance benchmarks for specific types of data.
  3. Try to find low-hanging fruit in terms of ways to improve your model.
There are various ways to improve a model. We'll have more guides that go into detail about how to improve model performance, but here are a few quick things to try:
  1. Try other regression model families (e.g. regularized regression, boosted trees, etc.).
  2. Collect more data if it's cheap to do so.
  3. Engineer smarter features after spending more time on exploratory analysis.
  4. Speak to a domain expert to get more context (...this is a good excuse to go wine tasting!). As a final note, when you try other families of models, we recommend using the same training and test set as you used to fit the random forest model. That's the best way to get a true apples-toapples comparison between your models.

Step 9: Save model for future use

Great job completing this tutorial! You've done the hard part, and deserve another glass of wine. Maybe this time you can use your shiny new predictive model to select the bottle. But before you go, let's save your hard work so you can use the model in the future. It's really easy to do so:
In [38]:
#Save model to a .pkl file
joblib.dump(clf, 'rf_regressor.pkl')
Out[38]:
['rf_regressor.pkl']
And that's it. When you want to load the model again, simply use this function:
In [39]:
#Load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')
In [40]:
# Predict data set using loaded model
clf2.predict(X_test)
Out[40]:
array([ 6.68,  5.92,  4.99,  5.54,  6.54,  5.79,  4.9 ,  4.64,  5.03,
        5.99,  5.23,  5.69,  5.75,  5.  ,  5.76,  5.6 ,  6.65,  5.76,
        5.67,  6.96,  5.5 ,  5.65,  5.  ,  6.  ,  5.93,  5.03,  5.47,
        5.17,  6.01,  5.91,  5.88,  6.69,  6.  ,  5.02,  4.87,  5.91,
        5.06,  5.81,  5.15,  5.85,  4.97,  5.9 ,  6.68,  5.13,  6.16,
        5.39,  5.58,  5.47,  5.08,  6.62,  5.91,  5.26,  5.91,  5.19,
        5.69,  5.81,  5.22,  5.36,  4.98,  5.3 ,  5.33,  5.12,  5.  ,
        5.79,  5.94,  5.24,  6.39,  5.05,  5.2 ,  6.59,  5.67,  5.47,
        5.04,  5.01,  5.26,  6.03,  5.29,  5.09,  5.4 ,  5.26,  6.64,
        5.58,  6.44,  6.68,  5.09,  5.79,  6.52,  6.18,  5.51,  5.98,
        5.86,  5.2 ,  6.39,  5.76,  5.71,  5.72,  6.74,  6.72,  5.47,
        6.82,  5.14,  5.38,  5.18,  6.6 ,  5.06,  4.7 ,  5.76,  5.05,
        5.84,  5.97,  5.76,  5.45,  6.09,  5.52,  4.79,  5.21,  5.93,
        5.  ,  4.9 ,  5.99,  5.8 ,  5.1 ,  5.84,  6.08,  5.26,  5.2 ,
        5.23,  5.86,  5.51,  5.39,  5.98,  6.49,  5.2 ,  5.27,  5.06,
        6.51,  5.04,  5.16,  6.74,  5.4 ,  5.12,  5.06,  5.86,  6.08,
        5.39,  5.4 ,  5.19,  6.69,  5.53,  5.09,  5.62,  5.16,  4.84,
        4.92,  5.22,  5.99,  5.36,  5.78,  5.78,  5.26,  5.48,  5.28,
        5.19,  6.02,  5.11,  6.04,  5.17,  5.08,  5.56,  5.2 ,  5.79,
        4.87,  5.61,  5.05,  5.54,  5.42,  5.03,  5.33,  5.65,  5.01,
        6.03,  5.62,  5.06,  4.97,  5.16,  6.1 ,  5.17,  5.69,  5.3 ,
        5.01,  5.42,  6.67,  5.82,  5.98,  5.47,  5.22,  5.42,  5.09,
        6.12,  4.64,  6.15,  5.07,  5.28,  5.28,  7.03,  5.87,  5.39,
        5.15,  5.39,  6.01,  6.04,  5.94,  5.96,  6.46,  5.8 ,  5.91,
        5.21,  5.17,  5.71,  5.25,  5.22,  6.02,  5.95,  5.68,  6.01,
        5.84,  5.41,  6.19,  5.34,  5.71,  5.41,  5.51,  6.24,  5.83,
        5.04,  4.2 ,  6.67,  6.52,  6.35,  5.2 ,  5.32,  5.46,  5.55,
        6.2 ,  5.93,  5.24,  5.08,  5.54,  5.3 ,  6.49,  5.14,  5.09,
        5.18,  5.11,  5.97,  6.58,  5.73,  5.37,  5.38,  6.38,  5.43,
        6.02,  5.32,  4.79,  5.78,  6.01,  5.89,  5.6 ,  5.26,  5.08,
        5.77,  5.49,  6.58,  6.13,  5.7 ,  5.14,  6.01,  6.64,  5.89,
        5.3 ,  5.6 ,  5.26,  5.35,  6.07,  6.83,  5.31,  6.78,  6.01,
        5.38,  5.43,  6.03,  5.11,  5.25,  6.14,  5.87,  5.97,  6.11,
        5.95,  5.4 ,  5.56,  5.53,  6.35,  5.46,  6.96,  6.83,  6.05,
        6.08,  5.09,  5.29,  5.91,  5.36,  5.36,  5.76,  6.71,  6.79,
        5.23,  5.56,  5.77,  6.05,  5.63])
Congratulations, you've reached the end of this tutorial! We've just completed a whirlwind tour of Scikit-Learn's core functionality, but we've only really scratched the surface. Hopefully you've gained some guideposts to further explore all that sklearn has to offer.
SHARE

Himanshu Rai

  • Image
  • Image
  • Image
  • Image
  • Image

0 Comments:

Post a Comment