Evaluation in Machine Learning 

Offline evaluation: offline evaluation is done to evaluate the performance of machine learning models. Offline metrics measure how close the predictions by the machine learning model on the evaluation dataset are to the graound truth dataset.
Online evaluation: online evaluation measures how the deployed model performs in production. Online evaluation metrics are typlically tied to to business objectives.

Offline evaluation metrics
Task	Offline metrics
Classification	Precision, recall, F1 score, accuracy
Regression	Mean squared error (MSE), root mean squared error (RMSE), MAE
Ranking	Precision@k, recall@k, mean average precision (mAP), normalized discounted cumulative gain(nDCG), mean reciprocal rank (MRR)
NLP	BLEU, ROUGE, METEOR, CIDEx, SPICE

Online evaluation metrics
Problem	Online metrics
Ad click prediction	Click-through rate, revenue lift.
Harmful content detection	Prevalence, valid appeals.
Video recommendation	Click-through rate, total watch time, number of completed videos.
Friend recommendation	Number of requests sent per day, number of requests accepted per day.

Precision 

In the context of binary classification (Yes/No), precision measures the model’s performance at classifying positive observations (i.e. “Yes”). In other words, when a positive value is predicted, how often is the prediction correct? We could game this metric by only returning positive for the single observation we are most confident in.

\[P = \frac{True Positives}{True Positives + False Positives}\]

Recall 

Also called sensitivity. In the context of binary classification (Yes/No), recall measures how “sensitive” the classifier is at detecting positive instances. In other words, for all the true observations in our sample, how many did we “catch.” We could game this metric by always classifying observations as positive.

\[R = \frac{True Positives}{True Positives + False Negatives}\]

Recall vs Precision 

Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) or not (False). We feed it into our model and our model starts guessing.

Precision is the % of True guesses that were actually correct! If we guess 1 image is True out of 100 images and that image is actually True, then our precision is 100%! Our results aren’t helpful however because we missed 10 brain tumors! We were super precise when we tried, but we didn’t try hard enough.

Recall, or Sensitivity, provides another lens which with to view how good our model is. Again let’s say there are 100 images, 10 with brain tumors, and we correctly guessed 1 had a brain tumor. Precision is 100%, but recall is 10%. Perfect recall requires that we catch all 10 tumors!