Pytorch learning notes

Apr 23, 2022

geek time pytorch course learning notes

basics

numpy

for reshape, it will first unroll all elements based on layout, then transform to final shape
to sum along axis 0, sum the values vertically, to sum along axis 1, sum the values horizontally

import numpy as np

array1 = np.array(
    [[1, 2],
     [3, 4],
     [5, 6]])

total_0_axis = np.sum(array1, axis=0)
print(f'Sum of elements at 0-axis is {total_0_axis}')

total_1_axis = np.sum(array1, axis=1)
print(f'Sum of elements at 1-axis is {total_1_axis}')
Output:


Sum of elements at 0-axis is [ 9 12]
Sum of elements at 1-axis is [ 3  7 11]

use newaxis to create axis before doing concatenation
shallow copy is a view of the original array, it does not share the shape, but share the values, ie changing the shape of the shallow copy will not change the shape of the original, but changing the value of the shallow copy will also change the value of the original
use argmax, argsort to find out the index of the maximum probability

tensor

use permute to change the size of tensor
- permute(a, b, c) means put axis a to axis 0, put axis b to axis 1, and put axis c to axis 2

x = torch.rand(2, 3, 4)
x = x.permute(2, 1, 0)
x.shape 
torch.Size([4, 3, 2])

unlike permute, transpose can only swap two axes
after using transpose or permute, the storage is no longer continuous, so we cannot use view to change its size, need to use reshape instead
use unsqueeze to augment dimension

x = torch.rand(2,1,3)
y = x.unsqueeze(2)
y.shape 
torch.Size([2,1,1,3])

concatenate tensors, dim=0 means stack along the row

torch.cat(tensors, dim=0, out=None)

stack creates new dimension when concatenating tensors

torch.stack(inputs, dim=0)

chunk splits the tensor evenly, chunks is the number of division, must be int, eg if we input 1D tensor with size 10, then we get a tuple with two size 5 tensor
- if not divisible, eg input size 17, chunk number is 4, then 17/4=4.25, use ceil to get 5, so final chunk size will be 5,5,5,2

torch.chunk(input, chunks, dim=0)

if instead of how many divisions, we want to specify the number in each division, then use split

torch.split(tensor, split_size_or_sections, dim=0)

unbind is equivalent to chunk size = dimension size, or split_size_or_sections = 1

torch.unbind(input, dim=0)

we can use index_select or masked_select to select index, masked_select only selects those where it is true in mask

data handling

we can inherit Dataset class to load the dataset, ie implement __init__(), __len__(), __getitem__()

import torch 
from torch.utils.data import Dataset 

class MyDataset(Dataset): 
    def __init__(self, data_tensor, target_tensor): 
        self.data_tensor = data_tensor 
        self.target_tensor = target_tensor 
    def __len__(self): 
        return self.data_tensor.size(0)
    def __getitem__(self, index): 
        return self.data_tensor[index], self.target_tensor[index]

DataLoader is an iterator, which takes a Dataset instance as input, and generates a training sample based on batch_size, and can use multi-thread

from torch.utils.data import DataLoader
tensor_dataloader = DataLoader(dataset=my_dataset, 
                        batch_size=2,
                        shuffle=True,
                        num_workers=0)

torchvision implements ways to handle dataset, process image, and use some networks, eg to load mnist, use

import torchvision 
mnist_dataset = torchvision.datasets.MNIST(
            root='./data',
            train=True, 
            transform=None, 
            target_transoform=None, 
            download=True 
    )

train means load trainset, otherwise will load testset, transform means preprocess the image, target_transform means preprocess the labels
- the return type is torchvision.datasets.mnist.MNIST, which is an inherited class of Dataset, so it provides len, getitem, etc
use transforms.ToTensor() to convert PIL.Image or Numpy.ndarray to Tensor, and use transforms.ToPILImage(mode=None) to convert tensor to PIL.Image
in transforms, standarization means output = (input - mean)/std
- the purpose is to make sure all images have similar distribution, so that training is more likely to converge
use pretrained model to do fine tuning

import torch 
import torchvision.models as models

googlenet = models.googlenet(pretrained=True)

fc_in_features = googlenet.fc.in_features 

googlenet.fc = torch.nn.Linear(fc_in_features, 10)

each epoch trains on entire dataset, each step trains on one mini batch

convolution

use Conv2d, when padding=same, it means output size equals to input size

class torch.nn.Conv2d(
        in_channels, 
        out_channels, 
        stride=1, 
        padding=0, 
        dilation=1,
        groups=1,
        bias=True,
        padding_mode='zeros',
        device=None, 
        dtype=None
    )

depthwise separable convolution, it includes depthwise and pointwise conv
- input size is m x h x w, output size is n x h’ x w’
- DW has m kernels, each has size 3x3
- PW conv has n 1x1 conv, each has m channels, used after DW
- this is used to reduce number of parameters in model
in conv2d, if groups != 1, then input is divided into groups, when groups = in_channels, it is DW, and input and output channel sizes for DW both equal to input channel size of data
for PW, kernel size is 1, input channel equals to output channel size of DW (=input channel size of original data), and output equals to desired output n
dilation is used for outputing pixel-wise segmentation, the conv kernel size is dilated by 0

visualization

tensorboard

to use tensorboard, we need a summarywriter

torch.utils.tensorboard.writer.SummaryWriter(log_dir=None)

to record scalar, use add_scalar
- tag is the name of data
- scalar value is a float for the value
- global step is the num of training steps
- walltime is the timestamp

add_scalar(tag, scalar_value, global_step=None, walltime=None)

use add_image to record image
- img tensor is Tensor or numpy
- dataformats is the format, eg CHW is channel x height x width

add_image(tag, img_tensor, global_step=None, walltime=None, dataformats='CHW')

visdom

example

from visdom import Visdom 
import numpy as np 
import time 

viz = Visdom()
viz.line([0.], [0], win='train_loss', opts=dict(title='train_loss'))

for n_iter in range(10): 
    loss = 0.2 * np.random.randn() + 1 
    viz.line([loss], [n_iter], win='train_loss', update='append')
    time.sleep(0.5)

img = np.zeros((3, 100, 100))
img[0] = np.arange(0, 10000).reshape(100, 100) / 10000 
img[1] = 1 - np.arange(0, 10000).reshape(100, 100) / 10000
viz.image(img)

distributive training

get gpu

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

use multiple gpu on one machine, we need DataParallel (DP)
- device ids are the gpus used for training, output device is the one for output, default is gpu 0
- the loss are computed concurrently, but all loss will be summed in output device, so this one needs more workload

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

in forward pass, data will split to multiple gpus, but the model will be copied to each gpu
use multiple gpus on multiple machines, we need DistributedDataParallel (DDP)
- DP uses single thread to control multiple gpus, in backprop default gpu updates the param, so its workload is higher
- DDP uses multi threads to control multiple gpus, it uses DistributedSampler to load data, to make sure there is no overlap between data, and in backprop the gpus update params in each of them
- DDP can also be used on one machine with multiple gpus
DDP concepts
- group is a thread, default is 1
- world_size means the total thread num
- rank means the priorty of thread, default machine’s rank is 0
DDP step 1, use init_process_group
- backend is nccl, used for gpu training
- init method is env://, initiailized from env variable
- world size is the num of machines
- rank is the rank of the current machine
- group name is the name of the group

torch.distributed.init_process_group(backend="nccl", init_method=None, world_size=-1, rank=0, group_name=None)

DDP step2, send model to the gpus

torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device)
net = torch.nn.parallel.DistributedDataParallel(net)

DDP step3, use DistributedSampler to split the data and send to gpus
- in DDP, we do not send data from main gpu, instead, each gpu load its own data
- when we use sampler, do not use data shuffle

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)

tutorial https://pyimagesearch.com/2021/10/18/introduction-to-distributed-training-in-pytorch/

image classification

VGG

use 3x3 kernel to replace 11x11, 7x7, 5x5 kernels
5x5 is replaced by 2 layers of 3x3, reduce param num, 25 -> 2x9 = 18
model is deeper, can extract more nonlinear feature

GoogLeNet

object have different sizes in different images, want to use multi-scale kernels
use 1x1, 3x3, 5x5 kernels in inception module
use 1x1 kernel to reduce param num

ResNet

when model goes deeper, it tends to overfit, and gradients will vanish or explode
model with 56 layer is worse than model with 20 layer, so model cannot learn f(x) = x easily
use shortcut connection to keep the identity mapping

EfficientNet

changes model depth, width, and resolution to achieve the best accuracy and reduce FLOPS

image segmentation

unlike classification, the feature maps in segmentation has similar size to original image
transpose convolution is used to decode the feature, so that output is close to original image size
- pad input feature with 0
- flip the kernel upside down, right side left
- do conv with stride = 1, padding = 0

class torch.nn.ConvTranspose2d(in_channels, 
                               out_channels,
                               kernel_size,
                               stride=1,
                               padding=0,
                               groups=1,
                               bias=True,
                               dilation=1)

Dice loss
- when the model outputs probability, instead of 0 or 1, it is soft dice loss, |P intersect G| approx to prob dot GT
evaluation uses mIoU, where m means the mean of all classes

NLP

to find keywords
- can use term frequency inverse doc freq TF-IDF, if a word appears in a doc more, then it is more important, but if a word appears in the database more, then it is less important
- text rank is similar to page rank, if a webpage is linked by many other pages, then it is more important, and if a page is important, then the pages it links will also be important
- Latent Dirichlet Allocation, based on topic

n-gram model, the word depends on previous words, eg 2-gram means w2 depends on w1
use a vector to represent a word, the distance between words represents the similarity
attention, input is query, key, value, output is attention value
- query is the output state of last time, zt-1, and key, value are the hidden state
- output depends on similarity between Q, K
- Attention(Q, K, V) = softmax(sim(Q, K))V

Technical notes

Discussion about this post