An Introduction to the PyTorch Training Workflow
Statistical modeling, machine learning, and deep learning all aim to solve the same core problem: how to fit an effective function (i.e. model) from data in order to predict unseen outcomes.
PyTorch breaks this process into clear, controllable, and composable components, providing a unified framework for model definition, training, and evaluation.
1. Where PyTorch Fits in the Modeling Landscape
Before discussing PyTorch, it is helpful to place it within the broader context of machine learning and deep learning.
1.1 From Machine Learning
In traditional machine learning, a problem is usually described as:
- A two-dimensional dataset, where each row represents one sample
- Multiple columns as features $X$
- One column as the target variable $y$
Here:
- $y$ can be a categorical variable (binary or multi-class classification)
- or a continuous value (regression)
Such problems are commonly solved using scikit-learn. A typical workflow looks like:
1model.fit(X_train, y_train)
2pred = model.predict(X_test)
These models share several common properties:
- Conceptually, they are estimators
- They have clear evaluation metrics (accuracy, recall, precision, RMSE, etc.)
- Their performance can be improved through hyperparameter tuning
From a modeling perspective, a model is essentially a function approximator: given input $X$, it outputs a prediction of $y$.
1.2 Deep Learning and the Role of Tensors
Deep learning does not change the core goal of prediction, but it significantly expands the types of data that can be modeled.
Unlike traditional machine learning, which mainly works with tabular data, deep learning often deals with:
- Images (pixel grids with channel information)
- Text (sequential data)
- Audio and time series (high-dimensional structured data)
These data types are not naturally represented as simple tables. This leads to a key concept in deep learning: the tensor.
A tensor can be understood as:
- A multi-dimensional array
- With built-in support for automatic differentiation
- Efficiently computed on GPUs
In PyTorch, all data, model parameters, model outputs, and loss values are represented as tensors.
Common tensor shapes include:
-
One-dimensional (labels)
1y.shape == [B] -
Two-dimensional (tabular features)
1x.shape == [B, num_features] -
Four-dimensional (images)
1x_image.shape == [B, 1, 64, 64]
where:
Bis the batch size1is the number of channels (grayscale image)64 × 64is the image resolution
2. One-hot Encoding and Categorical Variables
In many deep learning tasks, inputs include categorical features such as type, status, or class.
If categories are encoded as integers:
1A = 0, B = 1, C = 2
the model may incorrectly assume that:
- C is “larger” than A
- B is “closer” to C than to A
Such numerical relationships are usually meaningless.
To avoid this, one-hot encoding is commonly used. A categorical variable is expanded into multiple binary variables:
1C → [0, 0, 1]
With this representation, the model only learns which category is active, without assuming any order or distance between categories.
3. Gradients, Functions, and Optimization: An Intuitive View
Once data is represented as tensors, each sample can be seen as a point in a high-dimensional space.
Modeling a dataset then becomes the task of fitting a function in that space:
$$f_\theta(X) \approx y$$
where:
- $\theta$ represents the model parameters
- The loss function measures how far predictions are from true values
The gradient is the partial derivative of the loss with respect to the parameters. It describes:
- The direction in which the loss increases
- How sensitive the loss is to small parameter changes
Gradient descent and its variants (such as Adam) use this information to update parameters and gradually reduce the loss.
4. The Overall PyTorch Training Workflow
PyTorch organizes the modeling process into a clear pipeline:
1Data (Dataset)
2→ Batch loading (DataLoader)
3→ Model (nn.Module)
4→ Loss function
5→ Optimizer
6→ Training loop
7→ Evaluation loop
5. Common Imports in PyTorch
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader
Although torch already contains all submodules, these imports are commonly written separately because:
torch.nncontains neural network componentsDataLoaderis used frequently in training code
This improves readability and reduces repetitive code.
6. Dataset and DataLoader: Getting Data into the Model
6.1 Dataset: Defining a Single Sample
A Dataset defines how to return one training sample given an index.
Example:
1class WaterDataset(Dataset):
2 def __init__(self, csv_path):
3 super().__init__()
4 df = pd.read_csv(csv_path)
5 self.data = df.to_numpy()
6
7 def __len__(self):
8 return self.data.shape[0]
9
10 def __getitem__(self, idx):
11 features = self.data[idx, :-1]
12 label = self.data[idx, -1]
13 return features, label
Here:
Datasetis a base class provided by PyTorch__len__defines the dataset size__getitem__defines how one sample is retrieved
The internal data format depends on the data source (CSV, images, tensors, etc.).
6.2 DataLoader: From Single Samples to Batches
Neural networks are usually trained in batches:
1train_loader = DataLoader(dataset, batch_size=64, shuffle=True)
Reasons include:
- Training with one sample at a time is inefficient
- Fixed sample order can lead to poor generalization
With a DataLoader, the training loop only needs to handle:
1for features, labels in train_loader:
2 ...
7. Models (nn.Module): The Core of PyTorch
In PyTorch, a model is not a simple function. It is an object that:
- Contains learnable parameters
- Supports automatic differentiation
- Can be saved and loaded
Models are therefore defined as classes that inherit from nn.Module.
7.1 Defining a Model with forward
Below is a simple multilayer perceptron (MLP) with three fully connected layers.
The forward method explicitly defines how data flows through the network.
1import torch.nn as nn
2import torch.nn.functional as F
3
4class Net(nn.Module):
5 def __init__(self):
6 super().__init__()
7 self.fc1 = nn.Linear(9, 16)
8 self.fc2 = nn.Linear(16, 8)
9 self.fc3 = nn.Linear(8, 1)
10
11 def forward(self, x):
12 x = F.relu(self.fc1(x))
13 x = F.relu(self.fc2(x))
14 return self.fc3(x)
All learnable parameters are automatically registered by layers such as nn.Linear.
7.2 Simplifying Models with nn.Sequential
When a network is a simple linear pipeline, nn.Sequential can be used to reduce boilerplate code.
1class Net(nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.net = nn.Sequential(
5 nn.Linear(9, 16),
6 nn.ReLU(),
7 nn.Linear(16, 8),
8 nn.ReLU(),
9 nn.Linear(8, 1)
10 )
11
12 def forward(self, x):
13 return self.net(x)
Conceptually:
Input → linear transform → activation → transform → output
nn.Sequential only simplifies layer organization; the forward method still defines the data flow.
8. Loss Functions: What the Model Optimizes
A loss function measures how different predictions are from true values.
For multi-class classification, cross-entropy loss is commonly used:
1criterion = nn.CrossEntropyLoss()
2loss = criterion(outputs, labels)
Intuitively:
- More uncertain or incorrect predictions lead to larger loss
- Better predictions lead to smaller loss
Other common losses include:
MSELossfor regressionBCELoss/BCEWithLogitsLossfor binary classification
9. Optimizers: Updating Parameters
During forward propagation, a model only produces predictions. Parameter updates are handled by the optimizer.
1optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Here:
model.parameters()specifies which parameters to updatelrcontrols the update step size
Adam (Adaptive Moment Estimation) is a first-order stochastic gradient-based optimizer. It adapts learning rates for individual parameters and is often stable in practice.
Other commonly used optimizers include:
torch.optim.SGD: the standard form of stochastic gradient descent, usually combined with momentumtorch.optim.AdamW: an improved version of Adam, widely used in modern Transformer-based modelstorch.optim.RMSprop: an optimizer that was commonly used in early RNN and sequence learning taskstorch.optim.Adagrad: often effective when features are sparse
The optimizer defines how gradient information is used to update parameters.
10. Training Loop: Where Learning Happens
The training loop repeatedly performs:
- Prediction
- Loss computation
- Parameter updates
1for epoch in range(num_epochs):
2 for data in dataloader:
3 optimizer.zero_grad()
4 features, labels = data
5 predictions = model(features)
6 loss = criterion(predictions, labels)
7 loss.backward()
8 optimizer.step()
Key concepts here are:
- Backpropagation: computing gradients using the chain rule
- Gradient descent and its variants: updating parameters in the direction that reduces loss
11. Evaluation Loop: Assessing the Model
After training, the model is evaluated using metrics similar to traditional machine learning:
- Accuracy
- Precision / recall
- Macro / micro / weighted averages
During evaluation, the model is set to evaluation mode:
1model.eval()
2with torch.no_grad():
3 ...
This disables gradient computation and ensures stable behavior.
Appendix: A Minimal, Runnable Example (ML vs DL)
To illustrate when deep learning is more suitable than traditional machine learning, the appendix compares:
- Logistic Regression on raw pixels
- A small CNN trained directly on images
The example uses the MNIST handwritten digit dataset and can be run directly in a local Jupyter Notebook.
1%pip -q install torch torchvision scikit-learn numpy
1import numpy as np
2import torch
3import torch.nn as nn
4from torch.utils.data import DataLoader, Subset
5from torchvision import datasets, transforms
6from sklearn.linear_model import LogisticRegression
7from sklearn.metrics import accuracy_score
8
9transform = transforms.ToTensor()
10train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
11test_ds = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
12
13rng = np.random.RandomState(42)
14train_idx = rng.choice(len(train_ds), size=12000, replace=False)
15test_idx = rng.choice(len(test_ds), size=2000, replace=False)
16
17train_sub = Subset(train_ds, train_idx)
18test_sub = Subset(test_ds, test_idx)
Machine Learning:Logistic Regression
1def to_numpy_flat(dataset_subset):
2 X_list, y_list = [], []
3 for x, y in dataset_subset:
4 X_list.append(x.view(-1).numpy()) # 28*28 = 784
5 y_list.append(y)
6 return np.stack(X_list), np.array(y_list)
7
8X_train, y_train = to_numpy_flat(train_sub)
9X_test, y_test = to_numpy_flat(test_sub)
10
11logreg = LogisticRegression(max_iter=200, solver="lbfgs", multi_class="auto")
12logreg.fit(X_train, y_train)
13
14pred = logreg.predict(X_test)
15acc_logreg = accuracy_score(y_test, pred)
16print(f"Logistic Regression accuracy: {acc_logreg:.3f}")
Deep Learning:CNN
1train_loader = DataLoader(train_sub, batch_size=64, shuffle=True)
2test_loader = DataLoader(test_sub, batch_size=256, shuffle=False)
3
4class SmallCNN(nn.Module):
5 def __init__(self):
6 super().__init__()
7 self.features = nn.Sequential(
8 nn.Conv2d(1, 16, kernel_size=3, padding=1),
9 nn.ReLU(),
10 nn.MaxPool2d(2),
11 nn.Conv2d(16, 32, kernel_size=3, padding=1),
12 nn.ReLU(),
13 nn.MaxPool2d(2),
14 nn.Flatten()
15 )
16 self.classifier = nn.Linear(32 * 7 * 7, 10)
17
18 def forward(self, x):
19 x = self.features(x)
20 return self.classifier(x)
21
22model = SmallCNN()
23criterion = nn.CrossEntropyLoss()
24optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
1for epoch in range(3):
2 model.train()
3 running_loss = 0.0
4 n = 0
5 for xb, yb in train_loader:
6 optimizer.zero_grad()
7 logits = model(xb)
8 loss = criterion(logits, yb)
9 loss.backward()
10 optimizer.step()
11 running_loss += loss.item() * yb.size(0)
12 n += yb.size(0)
13 print(f"Epoch {epoch+1} - Loss: {running_loss/n:.4f}")
Evaluation:Accuracy
1model.eval()
2correct = 0
3total = 0
4with torch.no_grad():
5 for xb, yb in test_loader:
6 preds = model(xb).argmax(dim=1)
7 correct += (preds == yb).sum().item()
8 total += yb.size(0)
9
10acc_cnn = correct / total
11print(f"CNN accuracy: {acc_cnn:.3f}")
12print(f"Performance gap (CNN - LogReg): {acc_cnn - acc_logreg:.3f}")