CPU Performance Data

Exploration

Relative CPU Performance Data

Source: UCI Machine Learning repository (http://www.ics.uci.edu/~mlearn/MLSummary.html).

  1. vendor name: 30 (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang)
  2. Model Name: many unique symbols
  3. MYCT: machine cycle time in nanoseconds (integer)
  4. MMIN: minimum main memory in kilobytes (integer)
  5. MMAX: maximum main memory in kilobytes (integer)
  6. CACH: cache memory in kilobytes (integer)
  7. CHMIN: minimum channels in units (integer)
  8. CHMAX: maximum channels in units (integer)
  9. PRP: published relative performance (integer)
  10. ERP: estimated relative performance from the original article (integer)
    The Estimated Benchmark Performance ERP was calculated by Prof. Ein-Dor of the Tel Aviv university and a colleague around 1988 on an Amdahl 470v/7 CPU.

Load the data

First, let's load the dataset and look what features are collected for the CPU's:

In the 1982 landscape, Amdahl delivers 9 CPU's of the premium class.

calculate RMSE for the estimates by the Prof. Ein-Dor

The original ERP vs BYTE's PRP metrics, reported on UCI, were measured as mean deviation percentage. This is not standard today, so later authors have chosen to use RMSE.

  1. Darragh Hanley: "Best Linear Regression RMSE : 68.35" (2015)
  2. Johannes Ledolter: "RMSE : 69.35" with Leave one out Cross Validation. (Data Mining and Business Analytics with R)

Apparently, the RMSE results of later preformed linear regressions came not close to the accuracy of the original Linear Regression, as found in the dataset field ERP (Estimated relative performance).

manual split

to isolate the ERP and PRP values, before splitting

feature engineering 1

feat engineering was no success, so I'll use KFold methode...

auto split

test_size=0.22, random_state=42

Cycles per time unit vs. cache size

Mean, sd and outliers of the 6 features.