Hands-On ML Mastery
Scikit-Learn & TensorFlow Concept Cards
🎯 Target Certifications (Roadmap)
Part 1: The Landscape
A sequential chain of transformations that automates the workflow: Data In -> Preprocessing -> Model ->
Prediction.
The Analogy
Think of an Automated Car Wash. The car (data) goes in dirty. It passes through Step 1
(Soap/Imputer), Step 2 (Scrub/Scaler), and Step 3 (Wax/Model). You don't wash parts of the car manually;
you push the button (`fit()`) and the machine does the rest in order.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
# Define the steps: Name + Transformer/Estimator
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')), # Fill missing
('scaler', StandardScaler()), # Scale features
('svc', SVC()) # The Model
])
# One call manages everything
pipe.fit(X_train, y_train)
pipe.predict(X_new)
The Gotcha!
Never call `fit_transform()` on your Test Set! You must `fit` on Training data (learn the mean/std) and
only `transform` the Test data. Pipelines handle this automatically if you call `predict()` on test
data.
Check: Why is 'ColumnTransformer' typically used before a Pipeline?
Because different columns need different washing! `ColumnTransformer` splits
the data: Categorical cols go to OneHotEncoder, Numerical cols go to Imputer/Scaler, then they merge
back together before entering the final Pipeline model.
An iterative optimization algorithm used to find the minimum of a function (Cost Function) by taking
small steps proportional to the negative of the gradient.
The Analogy
Walking down a misty mountain. You can't see the bottom (Global Minimum). You feel the
slope under your feet. If it slopes down to the right, you take a step right. The size of the step is
your Learning Rate.
The Gotcha!
If features are on different scales (e.g., Rooms=5, Price=500,000), the "mountain" becomes a long, thin
valley. You bounce back and forth forever trying to reach the bottom. Always Scale Data
(StandardScaler) before using Gradient Descent!
Check: What happens if your 'steps' (Learning Rate) are too big?
Divergence. You jump across the valley to the other side, possibly climbing
higher than where you started. You miss the bottom entirely.
The content trade-off. Bias is error from wrong assumptions (Model too simple).
Variance is sensitivity to noise (Model too complex).
The Analogy
Underfitting (High Bias): Trying to kill Godzilla with a flyswatter. The tool is
just too weak for the job.
Overfitting (High Variance): Connecting the dots on a starry night to draw a
constellation that includes every single star, even the dust. It looks perfect but predicts nothing
about the next patch of sky.
# High Bias (Underfitting)
model = LinearRegression() # Straight line for complex data
# High Variance (Overfitting)
model = DecisionTreeRegressor(max_depth=None) # Memorizes noise
# Balanced (Regularization)
model = Ridge(alpha=1.0) # Constrains the model
Check: If your Training Error is low but Validation Error is high, what do you
have?
High Variance (Overfitting). Your model memorized the training book but fails
the test. You need Regularization or more data.
Part 2: The Toolbox
A classifier that finds the "widest street" (margin) separating two classes. It cares most about the
points on the edge (Support Vectors).
The Analogy
Imagine a road between two crowds of people. SVM tries to make the road as wide as possible without
touching anyone. The people standing right on the curb are the Support Vectors. If they
move, the road moves. The people in the back don't matter.
from sklearn.svm import SVC
# Kernel Trick: "rbf" lifts data into 3D to separate circles
# C: Inverse regularization.
# Low C = Wider Street (More violations allowed, Low Variance)
# High C = Narrow Street (Strict, High Variance)
model = SVC(kernel="rbf", C=1.0, gamma="scale")
The Gotcha!
SVMs are distance-based. If you forget `StandardScaler`, the variable with larger numbers (e.g. Salary)
will dominate the distance calc, checking the "street" completely.
An ensemble of many Decision Trees, each trained on a random subset of data (Bagging) and random subset
of features.
The Analogy
Wisdom of the Crowd. Asking one expert (Decision Tree) is risky; they might be biased
or quirky. Asking 100 random people (Forest) and taking the average vote cancels out individual quirks
(Variance) and reveals the truth.
from sklearn.ensemble import RandomForestClassifier
# 100 Trees, using all CPU cores (n_jobs=-1)
# OOB Score: Uses leftover data for free validation
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
# Feature Importance is free!
print(rf.feature_importances_)
Check: Why doesn't a Random Forest overfit as easily as a Decision Tree?
Because of Bagging. By averaging many uncorrelated trees, the
Variance is reduced. The error of individual trees cancels out.
Principal Component Analysis rotates and projects data onto a lower-dimensional plane that preserves the
most Variance (Information).
The Analogy
The Shadow. You have a 3D object (your hand). You want to project it onto a 2D wall
(shadow) while keeping it recognizable. If you shine the light from the side, the shadow is a thin line
(bad, low variance). If you shine it flat on, you see the full hand shape (good, high variance). PCA
finds the perfect angle for the light.
The Gotcha!
PCA assumes linear correlations. If your data is a rolled-up Swiss Roll (manifold), smashing it flat
will merge layers that shouldn't touch. You might need t-SNE or LLE instead.
Part 3: The Engine (Neural Nets)
Mathematical functions that introduce non-linearity into the network. Without them, a
deep stack of layers is just one big Linear Regression.
The Analogy
The Gatekeeper. A neuron gathers inputs. The Activation function decides "Should I
fire?".
ReLU (Rectified Linear Unit) is like a bouncer who says: "If you are negative, you
are Zero. If you are positive, pass through unchanged."
`max(0, z)`
from tensorflow import keras
# ReLU is standard for hidden layers (Fast, no vanishing gradient)
# Softmax is standard for Output layer (Multi-class prob)
model = keras.models.Sequential([
keras.layers.Dense(30, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
Check: Why choose ReLU over Sigmoid for deep networks?
Sigmoid squashes numbers between 0 and 1. In deep nets, gradients get
multiplied repeatedly (Chain Rule). 0.5 * 0.5 * 0.5... becomes tiny (Vanishing
Gradient). ReLU doesn't squash positives, keeping gradients healthy.
The logic that updates the weights based on the gradients. Adam (Adaptive Moment Estimation) is the
"default" choice.
The Analogy
The Heavy Ball.
Regular SGD is like a drunk person walking downhill (noisy).
Momentum is like a heavy ball rolling downhill (gains speed).
Adam is a heavy ball with friction that knows the terrain; it speeds up on
straights and slows down carefully for corners.
The Gotcha!
Learning Rate is still the most important hyperparameter. Adam adapts per-parameter, but you still set
the global `learning_rate` (eta). Too high = Explodes. Too low = Sleepy ball.
Part 4: Deep Architectures
A network that scans images with small "Filters" (Kernels) to detect local features (edges, lines), then
hierarchically combines them into shapes and objects.
The Analogy
The Flashlight. You are looking for a cat in a dark room. You use a small flashlight
(Filter) and scan across the top left, then move right (Stride). You don't try to see the whole room at
once. You find an ear, then an eye, and stitch them together.
model = keras.models.Sequential([
# 64 Flashlights, 7x7 size.
keras.layers.Conv2D(64, kernel_size=7, activation="relu", padding="same"),
# Shrink the image (Max Pooling) - Keep only the strongest feature
keras.layers.MaxPooling2D(pool_size=2),
...
])
Check: What is the purpose of Pooling?
To reduce dimensionality (computation cost) and provide
Invariance. If the cat moves 1 pixel to the right, the Max Pool output likely
remains the same.
Nets with "Memory". They process inputs one by one, passing a "State" vector from previous step to the
next. LSTMs (Long Short-Term Memory) fix the "forgetting" problem.
The Analogy
Reading a sentence. When you read the last word of a sentence, you understand it
because you remember the context of the first word. You don't read words in isolation. LSTM is like a
reader with a notepad who writes down important context ("Subject is singular") and carries it forward.
The Gotcha!
RNNs are slow (Sequential processing, hard to parallelize). For long sequences (NLP),
Transformers have largely replaced specific RNNs. But for Time Series (IoT, stock
price), RNNs/LSTMs are still valid.
Taking a model trained on a massive dataset (e.g., ResNet on ImageNet) and reusing its lower layers
(Feature Detectors) for your specific, smaller task.
The Analogy
The Rental Car. You don't build a car engine from scratch to go to the grocery store.
You rent a Ferrari, paint it a different color (Top Layers), and drive it. The engine (Lower Layers)
already knows how to detect curves, lines, and textures.
base_model = keras.applications.ResNet50(weights="imagenet", include_top=False)
base_model.trainable = False # Freeze the Ferrari Engine!
# Add your custom destination
model = keras.models.Sequential([
base_model,
keras.layers.GlobalAveragePooling2D(),
keras.layers.Dense(10, activation="softmax")
])
Hands-On ML Mastery • Generated for Personal Study