Contents

An Introduction to the PyTorch Training Workflow

Summary

Statistical modeling, machine learning, and deep learning all aim to solve the same core problem: how to fit an effective function (i.e. model) from data in order to predict unseen outcomes.

PyTorch breaks this process into clear, controllable, and composable components, providing a unified framework for model definition, training, and evaluation.

1. Where PyTorch Fits in the Modeling Landscape

Before discussing PyTorch, it is helpful to place it within the broader context of machine learning and deep learning.

1.1 From Machine Learning

In traditional machine learning, a problem is usually described as:

  • A two-dimensional dataset, where each row represents one sample
  • Multiple columns as features $X$
  • One column as the target variable $y$

Here:

  • $y$ can be a categorical variable (binary or multi-class classification)
  • or a continuous value (regression)

Such problems are commonly solved using scikit-learn. A typical workflow looks like:

1model.fit(X_train, y_train)
2pred = model.predict(X_test)

These models share several common properties:

  • Conceptually, they are estimators
  • They have clear evaluation metrics (accuracy, recall, precision, RMSE, etc.)
  • Their performance can be improved through hyperparameter tuning

From a modeling perspective, a model is essentially a function approximator: given input $X$, it outputs a prediction of $y$.

1.2 Deep Learning and the Role of Tensors

Deep learning does not change the core goal of prediction, but it significantly expands the types of data that can be modeled.

Unlike traditional machine learning, which mainly works with tabular data, deep learning often deals with:

  • Images (pixel grids with channel information)
  • Text (sequential data)
  • Audio and time series (high-dimensional structured data)

These data types are not naturally represented as simple tables. This leads to a key concept in deep learning: the tensor.

A tensor can be understood as:

  • A multi-dimensional array
  • With built-in support for automatic differentiation
  • Efficiently computed on GPUs

In PyTorch, all data, model parameters, model outputs, and loss values are represented as tensors.

Common tensor shapes include:

  • One-dimensional (labels)

    1y.shape == [B]
    
  • Two-dimensional (tabular features)

    1x.shape == [B, num_features]
    
  • Four-dimensional (images)

    1x_image.shape == [B, 1, 64, 64]
    

where:

  • B is the batch size
  • 1 is the number of channels (grayscale image)
  • 64 × 64 is the image resolution

2. One-hot Encoding and Categorical Variables

In many deep learning tasks, inputs include categorical features such as type, status, or class.

If categories are encoded as integers:

1A = 0, B = 1, C = 2

the model may incorrectly assume that:

  • C is “larger” than A
  • B is “closer” to C than to A

Such numerical relationships are usually meaningless.

To avoid this, one-hot encoding is commonly used. A categorical variable is expanded into multiple binary variables:

1C → [0, 0, 1]

With this representation, the model only learns which category is active, without assuming any order or distance between categories.

3. Gradients, Functions, and Optimization: An Intuitive View

Once data is represented as tensors, each sample can be seen as a point in a high-dimensional space.

Modeling a dataset then becomes the task of fitting a function in that space:

$$f_\theta(X) \approx y$$

where:

  • $\theta$ represents the model parameters
  • The loss function measures how far predictions are from true values

The gradient is the partial derivative of the loss with respect to the parameters. It describes:

  • The direction in which the loss increases
  • How sensitive the loss is to small parameter changes

Gradient descent and its variants (such as Adam) use this information to update parameters and gradually reduce the loss.

4. The Overall PyTorch Training Workflow

PyTorch organizes the modeling process into a clear pipeline:

1Data (Dataset)
2→ Batch loading (DataLoader)
3→ Model (nn.Module)
4→ Loss function
5→ Optimizer
6→ Training loop
7→ Evaluation loop

5. Common Imports in PyTorch

1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader

Although torch already contains all submodules, these imports are commonly written separately because:

  • torch.nn contains neural network components
  • DataLoader is used frequently in training code

This improves readability and reduces repetitive code.

6. Dataset and DataLoader: Getting Data into the Model

6.1 Dataset: Defining a Single Sample

A Dataset defines how to return one training sample given an index.

Example:

 1class WaterDataset(Dataset):
 2    def __init__(self, csv_path):
 3        super().__init__()
 4        df = pd.read_csv(csv_path)
 5        self.data = df.to_numpy()
 6
 7    def __len__(self):
 8        return self.data.shape[0]
 9
10    def __getitem__(self, idx):
11        features = self.data[idx, :-1]
12        label = self.data[idx, -1]
13        return features, label

Here:

  • Dataset is a base class provided by PyTorch
  • __len__ defines the dataset size
  • __getitem__ defines how one sample is retrieved

The internal data format depends on the data source (CSV, images, tensors, etc.).

6.2 DataLoader: From Single Samples to Batches

Neural networks are usually trained in batches:

1train_loader = DataLoader(dataset, batch_size=64, shuffle=True)

Reasons include:

  • Training with one sample at a time is inefficient
  • Fixed sample order can lead to poor generalization

With a DataLoader, the training loop only needs to handle:

1for features, labels in train_loader:
2    ...

7. Models (nn.Module): The Core of PyTorch

In PyTorch, a model is not a simple function. It is an object that:

  • Contains learnable parameters
  • Supports automatic differentiation
  • Can be saved and loaded

Models are therefore defined as classes that inherit from nn.Module.

7.1 Defining a Model with forward

Below is a simple multilayer perceptron (MLP) with three fully connected layers.

The forward method explicitly defines how data flows through the network.

 1import torch.nn as nn
 2import torch.nn.functional as F
 3
 4class Net(nn.Module):
 5    def __init__(self):
 6        super().__init__()
 7        self.fc1 = nn.Linear(9, 16)
 8        self.fc2 = nn.Linear(16, 8)
 9        self.fc3 = nn.Linear(8, 1)
10        
11    def forward(self, x):
12        x = F.relu(self.fc1(x))
13        x = F.relu(self.fc2(x))
14        return self.fc3(x)

All learnable parameters are automatically registered by layers such as nn.Linear.

7.2 Simplifying Models with nn.Sequential

When a network is a simple linear pipeline, nn.Sequential can be used to reduce boilerplate code.

 1class Net(nn.Module):
 2    def __init__(self):
 3        super().__init__()
 4        self.net = nn.Sequential(
 5            nn.Linear(9, 16),
 6            nn.ReLU(),
 7            nn.Linear(16, 8),
 8            nn.ReLU(),
 9            nn.Linear(8, 1)
10        )
11
12    def forward(self, x):
13        return self.net(x)

Conceptually:

Input → linear transform → activation → transform → output

nn.Sequential only simplifies layer organization; the forward method still defines the data flow.

8. Loss Functions: What the Model Optimizes

A loss function measures how different predictions are from true values.

For multi-class classification, cross-entropy loss is commonly used:

1criterion = nn.CrossEntropyLoss()
2loss = criterion(outputs, labels)

Intuitively:

  • More uncertain or incorrect predictions lead to larger loss
  • Better predictions lead to smaller loss

Other common losses include:

  • MSELoss for regression
  • BCELoss / BCEWithLogitsLoss for binary classification

9. Optimizers: Updating Parameters

During forward propagation, a model only produces predictions. Parameter updates are handled by the optimizer.

1optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Here:

  • model.parameters() specifies which parameters to update
  • lr controls the update step size

Adam (Adaptive Moment Estimation) is a first-order stochastic gradient-based optimizer. It adapts learning rates for individual parameters and is often stable in practice.

Other commonly used optimizers include:

  • torch.optim.SGD: the standard form of stochastic gradient descent, usually combined with momentum
  • torch.optim.AdamW: an improved version of Adam, widely used in modern Transformer-based models
  • torch.optim.RMSprop: an optimizer that was commonly used in early RNN and sequence learning tasks
  • torch.optim.Adagrad: often effective when features are sparse

The optimizer defines how gradient information is used to update parameters.

10. Training Loop: Where Learning Happens

The training loop repeatedly performs:

  • Prediction
  • Loss computation
  • Parameter updates
1for epoch in range(num_epochs):
2    for data in dataloader:
3        optimizer.zero_grad()
4        features, labels = data
5        predictions = model(features)
6        loss = criterion(predictions, labels)
7        loss.backward()
8        optimizer.step()

Key concepts here are:

  • Backpropagation: computing gradients using the chain rule
  • Gradient descent and its variants: updating parameters in the direction that reduces loss

11. Evaluation Loop: Assessing the Model

After training, the model is evaluated using metrics similar to traditional machine learning:

  • Accuracy
  • Precision / recall
  • Macro / micro / weighted averages

During evaluation, the model is set to evaluation mode:

1model.eval()
2with torch.no_grad():
3    ...

This disables gradient computation and ensures stable behavior.

Appendix: A Minimal, Runnable Example (ML vs DL)

To illustrate when deep learning is more suitable than traditional machine learning, the appendix compares:

  • Logistic Regression on raw pixels
  • A small CNN trained directly on images

The example uses the MNIST handwritten digit dataset and can be run directly in a local Jupyter Notebook.

1%pip -q install torch torchvision scikit-learn numpy
 1import numpy as np
 2import torch
 3import torch.nn as nn
 4from torch.utils.data import DataLoader, Subset
 5from torchvision import datasets, transforms
 6from sklearn.linear_model import LogisticRegression
 7from sklearn.metrics import accuracy_score
 8
 9transform = transforms.ToTensor()
10train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
11test_ds  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
12
13rng = np.random.RandomState(42)
14train_idx = rng.choice(len(train_ds), size=12000, replace=False)
15test_idx  = rng.choice(len(test_ds), size=2000, replace=False)
16
17train_sub = Subset(train_ds, train_idx)
18test_sub  = Subset(test_ds, test_idx)

Machine Learning:Logistic Regression

 1def to_numpy_flat(dataset_subset):
 2    X_list, y_list = [], []
 3    for x, y in dataset_subset:
 4        X_list.append(x.view(-1).numpy())  # 28*28 = 784
 5        y_list.append(y)
 6    return np.stack(X_list), np.array(y_list)
 7
 8X_train, y_train = to_numpy_flat(train_sub)
 9X_test, y_test = to_numpy_flat(test_sub)
10
11logreg = LogisticRegression(max_iter=200, solver="lbfgs", multi_class="auto")
12logreg.fit(X_train, y_train)
13
14pred = logreg.predict(X_test)
15acc_logreg = accuracy_score(y_test, pred)
16print(f"Logistic Regression accuracy: {acc_logreg:.3f}")

Deep Learning:CNN

 1train_loader = DataLoader(train_sub, batch_size=64, shuffle=True)
 2test_loader  = DataLoader(test_sub, batch_size=256, shuffle=False)
 3
 4class SmallCNN(nn.Module):
 5    def __init__(self):
 6        super().__init__()
 7        self.features = nn.Sequential(
 8            nn.Conv2d(1, 16, kernel_size=3, padding=1),
 9            nn.ReLU(),
10            nn.MaxPool2d(2),
11            nn.Conv2d(16, 32, kernel_size=3, padding=1),
12            nn.ReLU(),
13            nn.MaxPool2d(2),
14            nn.Flatten()
15        )
16        self.classifier = nn.Linear(32 * 7 * 7, 10)
17
18    def forward(self, x):
19        x = self.features(x)
20        return self.classifier(x)
21
22model = SmallCNN()
23criterion = nn.CrossEntropyLoss()
24optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 1for epoch in range(3):
 2    model.train()
 3    running_loss = 0.0
 4    n = 0
 5    for xb, yb in train_loader:
 6        optimizer.zero_grad()
 7        logits = model(xb)
 8        loss = criterion(logits, yb)
 9        loss.backward()
10        optimizer.step()
11        running_loss += loss.item() * yb.size(0)
12        n += yb.size(0)
13    print(f"Epoch {epoch+1} - Loss: {running_loss/n:.4f}")

Evaluation:Accuracy

 1model.eval()
 2correct = 0
 3total = 0
 4with torch.no_grad():
 5    for xb, yb in test_loader:
 6        preds = model(xb).argmax(dim=1)
 7        correct += (preds == yb).sum().item()
 8        total += yb.size(0)
 9
10acc_cnn = correct / total
11print(f"CNN accuracy: {acc_cnn:.3f}")
12print(f"Performance gap (CNN - LogReg): {acc_cnn - acc_logreg:.3f}")