Evaluation in Machine Learning
Offline evaluation: offline evaluation is done to evaluate the performance of machine learning models. Offline metrics measure how close the predictions by the machine learning model on the evaluation dataset are to the graound truth dataset.
Online evaluation: online evaluation measures how the deployed model performs in production. Online evaluation metrics are typlically tied to to business objectives.
Task |
Offline metrics |
|---|---|
Classification |
Precision, recall, F1 score, accuracy |
Regression |
Mean squared error (MSE), root mean squared error (RMSE), MAE |
Ranking |
Precision@k, recall@k, mean average precision (mAP), normalized discounted cumulative gain(nDCG), mean reciprocal rank (MRR) |
NLP |
BLEU, ROUGE, METEOR, CIDEx, SPICE |
Problem |
Online metrics |
|---|---|
Ad click prediction |
Click-through rate, revenue lift. |
Harmful content detection |
Prevalence, valid appeals. |
Video recommendation |
Click-through rate, total watch time, number of completed videos. |
Friend recommendation |
Number of requests sent per day, number of requests accepted per day. |
Precision
In the context of binary classification (Yes/No), precision measures the model’s performance at classifying positive observations (i.e. “Yes”). In other words, when a positive value is predicted, how often is the prediction correct? We could game this metric by only returning positive for the single observation we are most confident in.
\[P = \frac{True Positives}{True Positives + False Positives}\]
Recall
Also called sensitivity. In the context of binary classification (Yes/No), recall measures how “sensitive” the classifier is at detecting positive instances. In other words, for all the true observations in our sample, how many did we “catch.” We could game this metric by always classifying observations as positive.
\[R = \frac{True Positives}{True Positives + False Negatives}\]
Recall vs Precision
Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) or not (False). We feed it into our model and our model starts guessing.
Precision is the % of True guesses that were actually correct! If we guess 1 image is True out of 100 images and that image is actually True, then our precision is 100%! Our results aren’t helpful however because we missed 10 brain tumors! We were super precise when we tried, but we didn’t try hard enough.
Recall, or Sensitivity, provides another lens which with to view how good our model is. Again let’s say there are 100 images, 10 with brain tumors, and we correctly guessed 1 had a brain tumor. Precision is 100%, but recall is 10%. Perfect recall requires that we catch all 10 tumors!