Hands-On ML Mastery

Scikit-Learn & TensorFlow Concept Cards

🎯 Target Certifications (Roadmap)

Part 1: The Landscape

The Pipeline

Scikit-Learn
A sequential chain of transformations that automates the workflow: Data In -> Preprocessing -> Model -> Prediction.
The Analogy Think of an Automated Car Wash. The car (data) goes in dirty. It passes through Step 1 (Soap/Imputer), Step 2 (Scrub/Scaler), and Step 3 (Wax/Model). You don't wash parts of the car manually; you push the button (`fit()`) and the machine does the rest in order.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.svm import SVC # Define the steps: Name + Transformer/Estimator pipe = Pipeline([ ('imputer', SimpleImputer(strategy='median')), # Fill missing ('scaler', StandardScaler()), # Scale features ('svc', SVC()) # The Model ]) # One call manages everything pipe.fit(X_train, y_train) pipe.predict(X_new)
The Gotcha! Never call `fit_transform()` on your Test Set! You must `fit` on Training data (learn the mean/std) and only `transform` the Test data. Pipelines handle this automatically if you call `predict()` on test data.

Check: Why is 'ColumnTransformer' typically used before a Pipeline?

Because different columns need different washing! `ColumnTransformer` splits the data: Categorical cols go to OneHotEncoder, Numerical cols go to Imputer/Scaler, then they merge back together before entering the final Pipeline model.

Gradient Descent

Fundamentals
An iterative optimization algorithm used to find the minimum of a function (Cost Function) by taking small steps proportional to the negative of the gradient.
The Analogy Walking down a misty mountain. You can't see the bottom (Global Minimum). You feel the slope under your feet. If it slopes down to the right, you take a step right. The size of the step is your Learning Rate.
The Gotcha! If features are on different scales (e.g., Rooms=5, Price=500,000), the "mountain" becomes a long, thin valley. You bounce back and forth forever trying to reach the bottom. Always Scale Data (StandardScaler) before using Gradient Descent!

Check: What happens if your 'steps' (Learning Rate) are too big?

Divergence. You jump across the valley to the other side, possibly climbing higher than where you started. You miss the bottom entirely.

Bias vs Variance

Theory
The content trade-off. Bias is error from wrong assumptions (Model too simple). Variance is sensitivity to noise (Model too complex).
The Analogy

Underfitting (High Bias): Trying to kill Godzilla with a flyswatter. The tool is just too weak for the job.

Overfitting (High Variance): Connecting the dots on a starry night to draw a constellation that includes every single star, even the dust. It looks perfect but predicts nothing about the next patch of sky.

# High Bias (Underfitting) model = LinearRegression() # Straight line for complex data # High Variance (Overfitting) model = DecisionTreeRegressor(max_depth=None) # Memorizes noise # Balanced (Regularization) model = Ridge(alpha=1.0) # Constrains the model

Check: If your Training Error is low but Validation Error is high, what do you have?

High Variance (Overfitting). Your model memorized the training book but fails the test. You need Regularization or more data.
Part 2: The Toolbox

Support Vector Machines (SVM)

Algorithm
A classifier that finds the "widest street" (margin) separating two classes. It cares most about the points on the edge (Support Vectors).
The Analogy Imagine a road between two crowds of people. SVM tries to make the road as wide as possible without touching anyone. The people standing right on the curb are the Support Vectors. If they move, the road moves. The people in the back don't matter.
from sklearn.svm import SVC # Kernel Trick: "rbf" lifts data into 3D to separate circles # C: Inverse regularization. # Low C = Wider Street (More violations allowed, Low Variance) # High C = Narrow Street (Strict, High Variance) model = SVC(kernel="rbf", C=1.0, gamma="scale")
The Gotcha! SVMs are distance-based. If you forget `StandardScaler`, the variable with larger numbers (e.g. Salary) will dominate the distance calc, checking the "street" completely.

Random Forests

Ensemble
An ensemble of many Decision Trees, each trained on a random subset of data (Bagging) and random subset of features.
The Analogy Wisdom of the Crowd. Asking one expert (Decision Tree) is risky; they might be biased or quirky. Asking 100 random people (Forest) and taking the average vote cancels out individual quirks (Variance) and reveals the truth.
from sklearn.ensemble import RandomForestClassifier # 100 Trees, using all CPU cores (n_jobs=-1) # OOB Score: Uses leftover data for free validation rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, oob_score=True) rf.fit(X_train, y_train) # Feature Importance is free! print(rf.feature_importances_)

Check: Why doesn't a Random Forest overfit as easily as a Decision Tree?

Because of Bagging. By averaging many uncorrelated trees, the Variance is reduced. The error of individual trees cancels out.

PCA (dimensionality Reduction)

Unsupervised
Principal Component Analysis rotates and projects data onto a lower-dimensional plane that preserves the most Variance (Information).
The Analogy The Shadow. You have a 3D object (your hand). You want to project it onto a 2D wall (shadow) while keeping it recognizable. If you shine the light from the side, the shadow is a thin line (bad, low variance). If you shine it flat on, you see the full hand shape (good, high variance). PCA finds the perfect angle for the light.
The Gotcha! PCA assumes linear correlations. If your data is a rolled-up Swiss Roll (manifold), smashing it flat will merge layers that shouldn't touch. You might need t-SNE or LLE instead.
Part 3: The Engine (Neural Nets)

Activation Functions (ReLU)

Deep Learning
Mathematical functions that introduce non-linearity into the network. Without them, a deep stack of layers is just one big Linear Regression.
The Analogy The Gatekeeper. A neuron gathers inputs. The Activation function decides "Should I fire?".
ReLU (Rectified Linear Unit) is like a bouncer who says: "If you are negative, you are Zero. If you are positive, pass through unchanged." `max(0, z)`
from tensorflow import keras # ReLU is standard for hidden layers (Fast, no vanishing gradient) # Softmax is standard for Output layer (Multi-class prob) model = keras.models.Sequential([ keras.layers.Dense(30, activation="relu"), keras.layers.Dense(10, activation="softmax") ])

Check: Why choose ReLU over Sigmoid for deep networks?

Sigmoid squashes numbers between 0 and 1. In deep nets, gradients get multiplied repeatedly (Chain Rule). 0.5 * 0.5 * 0.5... becomes tiny (Vanishing Gradient). ReLU doesn't squash positives, keeping gradients healthy.

The Optimizer (Adam)

Training
The logic that updates the weights based on the gradients. Adam (Adaptive Moment Estimation) is the "default" choice.
The Analogy The Heavy Ball.
Regular SGD is like a drunk person walking downhill (noisy).
Momentum is like a heavy ball rolling downhill (gains speed).
Adam is a heavy ball with friction that knows the terrain; it speeds up on straights and slows down carefully for corners.
The Gotcha! Learning Rate is still the most important hyperparameter. Adam adapts per-parameter, but you still set the global `learning_rate` (eta). Too high = Explodes. Too low = Sleepy ball.
Part 4: Deep Architectures

Convolutional Neural Net (CNN)

Vision
A network that scans images with small "Filters" (Kernels) to detect local features (edges, lines), then hierarchically combines them into shapes and objects.
The Analogy The Flashlight. You are looking for a cat in a dark room. You use a small flashlight (Filter) and scan across the top left, then move right (Stride). You don't try to see the whole room at once. You find an ear, then an eye, and stitch them together.
model = keras.models.Sequential([ # 64 Flashlights, 7x7 size. keras.layers.Conv2D(64, kernel_size=7, activation="relu", padding="same"), # Shrink the image (Max Pooling) - Keep only the strongest feature keras.layers.MaxPooling2D(pool_size=2), ... ])

Check: What is the purpose of Pooling?

To reduce dimensionality (computation cost) and provide Invariance. If the cat moves 1 pixel to the right, the Max Pool output likely remains the same.

Recurrent Neural Net (RNN) / LSTM

Sequences
Nets with "Memory". They process inputs one by one, passing a "State" vector from previous step to the next. LSTMs (Long Short-Term Memory) fix the "forgetting" problem.
The Analogy Reading a sentence. When you read the last word of a sentence, you understand it because you remember the context of the first word. You don't read words in isolation. LSTM is like a reader with a notepad who writes down important context ("Subject is singular") and carries it forward.
The Gotcha! RNNs are slow (Sequential processing, hard to parallelize). For long sequences (NLP), Transformers have largely replaced specific RNNs. But for Time Series (IoT, stock price), RNNs/LSTMs are still valid.

Transfer Learning

Efficiency
Taking a model trained on a massive dataset (e.g., ResNet on ImageNet) and reusing its lower layers (Feature Detectors) for your specific, smaller task.
The Analogy The Rental Car. You don't build a car engine from scratch to go to the grocery store. You rent a Ferrari, paint it a different color (Top Layers), and drive it. The engine (Lower Layers) already knows how to detect curves, lines, and textures.
base_model = keras.applications.ResNet50(weights="imagenet", include_top=False) base_model.trainable = False # Freeze the Ferrari Engine! # Add your custom destination model = keras.models.Sequential([ base_model, keras.layers.GlobalAveragePooling2D(), keras.layers.Dense(10, activation="softmax") ])
Hands-On ML Mastery • Generated for Personal Study