CPU Performance Data

Exploration

Relative CPU Performance Data

Source: UCI Machine Learning repository (http://www.ics.uci.edu/~mlearn/MLSummary.html).

  1. vendor name: 30 (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang)
  2. Model Name: many unique symbols
  3. MYCT: machine cycle time in nanoseconds (integer)
  4. MMIN: minimum main memory in kilobytes (integer)
  5. MMAX: maximum main memory in kilobytes (integer)
  6. CACH: cache memory in kilobytes (integer)
  7. CHMIN: minimum channels in units (integer)
  8. CHMAX: maximum channels in units (integer)
  9. PRP: published relative performance (integer)
  10. ERP: estimated relative performance from the original article (integer)
    The Estimated Benchmark Performance ERP was calculated by Prof. Ein-Dor of the Tel Aviv university and a colleague around 1988 on an Amdahl 470v/7 CPU.

Load the data

First, let's load the dataset and look what features are collected for the CPU's:

In the 1982 landscape, Amdahl delivers 9 CPU's of the premium class.

calculate RMSE for the estimates by the Prof. Ein-Dor

The original ERP vs BYTE's PRP metrics, reported on UCI, were measured as mean deviation percentage. This is not standard today, so later authors have chosen to use RMSE.

  1. Darragh Hanley: "Best Linear Regression RMSE : 68.35" (2015)
  2. Johannes Ledolter: "RMSE : 69.35" with Leave one out Cross Validation. (Data Mining and Business Analytics with R)

Apparently, the RMSE results of later preformed linear regressions came not close to the accuracy of the original Linear Regression, as found in the dataset field ERP (Estimated relative performance).

manual split

to isolate the ERP and PRP values, before splitting

feature engineering 1

feat engineering was no success, so I'll use KFold methode...

auto split

test_size=0.22, random_state=42

Cycles per time unit vs. cache size

Mean, sd and outliers of the 6 features.

Find interactions between the features

performing a multiplication of features, while using a cross val score ranking the products upon r².

Highest correlation is found with features 'MMIN' and'CHMAX'.

Adding interactions and transformed variables leads to an extended linear regression model, a polynomial regression. Data scientists rely on testing and experimenting to validate an approach to solving a problem. We redefine the set of predictors in code using interactions and quadratic terms by squaring the variables:

I make a graph of the results that demonstrates some additions are great because the squared error decreases, and other additions are terrible because they increase the error instead. Adding the feat. max channels is ok, but adding the square of the f. cycle times is not helping to improve the model.

Let's multiply these more correlated features: ('MMIN', 'MMAX', 0.928), ('MMIN', 'CHMIN', 0.89), ('MMIN', 'CHMAX', 0.963), ('MMAX', 'CACH', 0.92), ('MMAX', 'CHMIN', 0.908), ('MMAX', 'CHMAX', 0.935) # ('MMIN', 'CACH', 0.853),

def plot_cv_indices

Find importance of the features

To decide on the importance of the features we are going to use LassoCV estimator. It has a built in crossvalidator.
The features with the highest absolute coef_ value are considered the most important.

Find importance of 7 features

To decide on the importance of the 6 features plus CYCinv, we are going to use LassoCV estimator. The features with the highest absolute coef_ value are considered the most important.

No success here with the new feature: 0.

Select from the model features with the highest score

Now we want to select the two features which are the most important. SelectFromModel() allows for setting the threshold. Only the features with the coef_ higher than the threshold will remain. Here, we want to set the threshold slightly above the third highest coef_ calculated by LassoCV() from our data.

Plot the two most important features

Finally we will plot the selected two features from the data.

Use of α makes a 4% better score!

Principal component analysis (PCA)

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

X_unscaled

X scaled

Typically, they leave at least 90% of data variance to be preserved.

EDA: approach nr. 2

Let’s compare how many unique rows we have for the test metrics.

lets take a look at Amdahl’s competitors in the market and how many product offerings they have.

The individual technical metrics in the sourced data set, to see how Amdahl products compare to that of the competition.

Log transformation was done in this plot, and reveals that scaling is advisable.

The green points are Amdahl's CPU's.

K-Means clustering

I want to cluster the CPUs into three main groups, which I call Budget, Mid range and Premium.

Ein-Dor is said to have bucketed the benchmark in 3 intervals: "6 to 33", "33 to 72", "72 to 1200".

Curiously, the 3 classes are only found in the highest PRP bin. In fact not a surprise as the budget class was overcrowded.

I'll stop EDA here now, and move on to experiment with some recent algorithms.

Deploying some sklearn algorithms

t-SNE Visualization

Random Forest Regressor

Shall I work with the raw values, or the logarithms, or a combination? A or B

I don't like the 0's in 200-208. Those are the poorest performers.

Return the coefficient of determination R2 of the prediction.

RF regressor metrics

RF regressor predictions vs. ERP

The ERP values were in 1988 calculated with a linear regression method on an Amdahl 470v/7 cpu.

Visualising the Random Forest Regression results

Histogram Gradient Boosting Regressor

This is an experimental regressor in sklearn.

Partial Dependence computation for Gradient Boosting

Let’s now fit a GradientBoostingRegressor and compute the partial dependence plots either or one or two variables at a time.

It appears that tree-based models are naturally robust to monotonic transformations of numerical features.

Note that on this tabular dataset, Gradient Boosting Machines are both significantly faster to train and more accurate than neural networks. It is also significantly cheaper to tune their hyperparameters (the default tend to work well while this is not often the case for neural networks).

Finally, as we will see next, computing partial dependence plots tree-based models is also orders of magnitude faster making it cheap to compute partial dependence plots for pairs of interacting features:

Gradient Boosting Regressor

Print out the mean absolute error (mae) with initial log transform.:

without log transform.: Mean Absolute Error: 125.2 s. Accuracy: 49.3 %.

Next, we will split our dataset to use 90% for training and leave the rest for testing. We will also set the regression model parameters.

just taking my own parameters without an extra train split...

Fit regression model

Now we will initiate the gradient boosting regressors and fit it with our training data. Let’s also look and the mean squared error on the test data.

Or I use my regressor that can set with extra params.

Plot training deviance

Plot feature importance

Careful, impurity-based feature importances can be misleading for high cardinality features (many unique values). As an alternative, the permutation importances of reg can be computed on a held out test set. See Permutation feature importance for more details.

For this example, the impurity-based and permutation methods identify the same 2 strongly predictive features but not in the same order. The third most predictive feature, “bp”, is also the same for the 2 methods. The remaining features are less predictive and the error bars of the permutation plot show that they overlap with 0.

Only 6 features: minimum memory size and the number of channels are toppers.
6 features plus PRP: minimum and max. memory size are chosen to be a good indicator to achieve higher cpu performance ratings.

Decision Tree Regression

The decision tree is a simple machine learning model for getting started with regression tasks.
Background

A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. (see here for more details).

predictions with max_depth of 3 and 5

predictions based on 2 principal components

We'll plot the predicted performance rate as function of cycle time, maximum memory and minimum memory.
Let’s evaluate the regressor on a grid of points where cycle time and minimum memory are combined:

After that, these two lists are combined using meshgrid, which generates a grid of all the combinations of the values. Finally, the result is passed to the regressor.

Combo Cycle time ns Minimum memory kB

The brown horizontal bar comes from a fast cpu with a low minimum memory, perhaps a transition model with a broader hardware compatibility. This led me to review the selection of the 2 pr. components:

Combo maximum memory and minimum memory

the predicted performance rate as function of maximum memory and minimum memory.

MAE

This metric compares N predictions with the target value, returning an average. It is already implemented in sklearn:

We can visualize the Decision Tree using "import tree"

Keras regression method by Sina

Original features

As Tensorflow migrated to 2.0 and moved on to 2.1, 2.2 and 2.3, many changes in the code happened. The TF / regress. code I had, could still run on v. 2.1, but no longer on 2.2. Thus I had to find a recent implementation, and so I found a nice one written by Mr. Sina.

The last 9 cpu's are too poor and disturb the fitting on the high end.

Split data into train/test

Target = trainDataset['PRP']

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

Model compiling settings

Add a mechanism that stops training if the validation loss is not improving for more than n_idle_epochs.

The fit.model returns a history object (a callback) for each model. This object stores useful information that we desire to extract and visualize. Let’s explore what is inside history:

which are the training and validation losses. Let’s visualize the MAE loss for training and validation with the code below:

Root Mean Squared Error

Track the model improvement progression

data with ERP and logged features

I could not reach a descent accuracy with the original features and ERP as target, so I added a layer with a logarithmic conversion.

Let's get the logarithm of dataframe

Split data into train/test

Target = trainDataset['ERP']

Model compiling settings

Add a mechanism that stops training if the validation loss is not improving for more than n_idle_epochs.

The fit.model returns a history object (a callback) for each model. This object stores useful information that we desire to extract and visualize. Let’s explore what is inside history:

which are the training and validation losses. Let’s visualize the MAE loss for training and validation with the code below:

Root Mean Squared Error

data with PRP and added features

Why wasn't the StandardScaler used in previous example?

Split data into train/test

Split off the target = trainDataset['PRP']

add several layers and drop out

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

Model compiling settings

Add a mechanism that stops training if the validation loss is not improving for more than n_idle_epochs.

history.history

The fit.model returns a history object (a callback) for each model. This object stores useful information that we desire to extract and visualize. Let’s explore what is inside history:

which are the training and validation losses. Let’s visualize the MAE loss for training and validation with the code below: