Technical notes

Share this post
Pytorch learning notes
moshan.substack.com

Pytorch learning notes

Mo Shan
Apr 23
Share this post
Pytorch learning notes
moshan.substack.com

about

  • geek time pytorch course learning notes

basics

numpy

  • for reshape, it will first unroll all elements based on layout, then transform to final shape

  • to sum along axis 0, sum the values vertically, to sum along axis 1, sum the values horizontally

import numpy as np

array1 = np.array(
    [[1, 2],
     [3, 4],
     [5, 6]])

total_0_axis = np.sum(array1, axis=0)
print(f'Sum of elements at 0-axis is {total_0_axis}')

total_1_axis = np.sum(array1, axis=1)
print(f'Sum of elements at 1-axis is {total_1_axis}')
Output:


Sum of elements at 0-axis is [ 9 12]
Sum of elements at 1-axis is [ 3  7 11]
  • use newaxis to create axis before doing concatenation

  • shallow copy is a view of the original array, it does not share the shape, but share the values, ie changing the shape of the shallow copy will not change the shape of the original, but changing the value of the shallow copy will also change the value of the original

  • use argmax, argsort to find out the index of the maximum probability

tensor

  • use permute to change the size of tensor

    • permute(a, b, c) means put axis a to axis 0, put axis b to axis 1, and put axis c to axis 2

x = torch.rand(2, 3, 4)
x = x.permute(2, 1, 0)
x.shape 
torch.Size([4, 3, 2])
  • unlike permute, transpose can only swap two axes

  • after using transpose or permute, the storage is no longer continuous, so we cannot use view to change its size, need to use reshape instead

  • use unsqueeze to augment dimension

x = torch.rand(2,1,3)
y = x.unsqueeze(2)
y.shape 
torch.Size([2,1,1,3])
  • concatenate tensors, dim=0 means stack along the row

torch.cat(tensors, dim=0, out=None)
  • stack creates new dimension when concatenating tensors

torch.stack(inputs, dim=0)
  • chunk splits the tensor evenly, chunks is the number of division, must be int, eg if we input 1D tensor with size 10, then we get a tuple with two size 5 tensor

    • if not divisible, eg input size 17, chunk number is 4, then 17/4=4.25, use ceil to get 5, so final chunk size will be 5,5,5,2

torch.chunk(input, chunks, dim=0)
  • if instead of how many divisions, we want to specify the number in each division, then use split

torch.split(tensor, split_size_or_sections, dim=0)
  • unbind is equivalent to chunk size = dimension size, or split_size_or_sections = 1

torch.unbind(input, dim=0)
  • we can use index_select or masked_select to select index, masked_select only selects those where it is true in mask

data handling

  • we can inherit Dataset class to load the dataset, ie implement __init__(), __len__(), __getitem__()

import torch 
from torch.utils.data import Dataset 

class MyDataset(Dataset): 
    def __init__(self, data_tensor, target_tensor): 
        self.data_tensor = data_tensor 
        self.target_tensor = target_tensor 
    def __len__(self): 
        return self.data_tensor.size(0)
    def __getitem__(self, index): 
        return self.data_tensor[index], self.target_tensor[index]
  • DataLoader is an iterator, which takes a Dataset instance as input, and generates a training sample based on batch_size, and can use multi-thread

from torch.utils.data import DataLoader
tensor_dataloader = DataLoader(dataset=my_dataset, 
                        batch_size=2,
                        shuffle=True,
                        num_workers=0)
  • torchvision implements ways to handle dataset, process image, and use some networks, eg to load mnist, use

import torchvision 
mnist_dataset = torchvision.datasets.MNIST(
            root='./data',
            train=True, 
            transform=None, 
            target_transoform=None, 
            download=True 
    )
  • train means load trainset, otherwise will load testset, transform means preprocess the image, target_transform means preprocess the labels

    • the return type is torchvision.datasets.mnist.MNIST, which is an inherited class of Dataset, so it provides len, getitem, etc

  • use transforms.ToTensor() to convert PIL.Image or Numpy.ndarray to Tensor, and use transforms.ToPILImage(mode=None) to convert tensor to PIL.Image

  • in transforms, standarization means output = (input - mean)/std

    • the purpose is to make sure all images have similar distribution, so that training is more likely to converge

  • use pretrained model to do fine tuning

import torch 
import torchvision.models as models

googlenet = models.googlenet(pretrained=True)

fc_in_features = googlenet.fc.in_features 

googlenet.fc = torch.nn.Linear(fc_in_features, 10)

  • each epoch trains on entire dataset, each step trains on one mini batch

convolution

  • use Conv2d, when padding=same, it means output size equals to input size

class torch.nn.Conv2d(
        in_channels, 
        out_channels, 
        stride=1, 
        padding=0, 
        dilation=1,
        groups=1,
        bias=True,
        padding_mode='zeros',
        device=None, 
        dtype=None
    )
  • depthwise separable convolution, it includes depthwise and pointwise conv

    • input size is m x h x w, output size is n x h’ x w’

    • DW has m kernels, each has size 3x3

    • PW conv has n 1x1 conv, each has m channels, used after DW

    • this is used to reduce number of parameters in model

  • in conv2d, if groups != 1, then input is divided into groups, when groups = in_channels, it is DW, and input and output channel sizes for DW both equal to input channel size of data

  • for PW, kernel size is 1, input channel equals to output channel size of DW (=input channel size of original data), and output equals to desired output n

  • dilation is used for outputing pixel-wise segmentation, the conv kernel size is dilated by 0

visualization

tensorboard

  • to use tensorboard, we need a summarywriter

torch.utils.tensorboard.writer.SummaryWriter(log_dir=None)
  • to record scalar, use add_scalar

    • tag is the name of data

    • scalar value is a float for the value

    • global step is the num of training steps

    • walltime is the timestamp

add_scalar(tag, scalar_value, global_step=None, walltime=None)
  • use add_image to record image

    • img tensor is Tensor or numpy

    • dataformats is the format, eg CHW is channel x height x width

add_image(tag, img_tensor, global_step=None, walltime=None, dataformats='CHW')

visdom

  • example

from visdom import Visdom 
import numpy as np 
import time 

viz = Visdom()
viz.line([0.], [0], win='train_loss', opts=dict(title='train_loss'))

for n_iter in range(10): 
    loss = 0.2 * np.random.randn() + 1 
    viz.line([loss], [n_iter], win='train_loss', update='append')
    time.sleep(0.5)

img = np.zeros((3, 100, 100))
img[0] = np.arange(0, 10000).reshape(100, 100) / 10000 
img[1] = 1 - np.arange(0, 10000).reshape(100, 100) / 10000
viz.image(img)

distributive training

  • get gpu

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  • use multiple gpu on one machine, we need DataParallel (DP)

    • device ids are the gpus used for training, output device is the one for output, default is gpu 0

    • the loss are computed concurrently, but all loss will be summed in output device, so this one needs more workload

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
  • in forward pass, data will split to multiple gpus, but the model will be copied to each gpu

  • use multiple gpus on multiple machines, we need DistributedDataParallel (DDP)

    • DP uses single thread to control multiple gpus, in backprop default gpu updates the param, so its workload is higher

    • DDP uses multi threads to control multiple gpus, it uses DistributedSampler to load data, to make sure there is no overlap between data, and in backprop the gpus update params in each of them

    • DDP can also be used on one machine with multiple gpus

  • DDP concepts

    • group is a thread, default is 1

    • world_size means the total thread num

    • rank means the priorty of thread, default machine’s rank is 0

  • DDP step 1, use init_process_group

    • backend is nccl, used for gpu training

    • init method is env://, initiailized from env variable

    • world size is the num of machines

    • rank is the rank of the current machine

    • group name is the name of the group

torch.distributed.init_process_group(backend="nccl", init_method=None, world_size=-1, rank=0, group_name=None)
  • DDP step2, send model to the gpus

torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device)
net = torch.nn.parallel.DistributedDataParallel(net)
  • DDP step3, use DistributedSampler to split the data and send to gpus

    • in DDP, we do not send data from main gpu, instead, each gpu load its own data

    • when we use sampler, do not use data shuffle

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)
  • tutorial https://pyimagesearch.com/2021/10/18/introduction-to-distributed-training-in-pytorch/

image classification

VGG

  • use 3x3 kernel to replace 11x11, 7x7, 5x5 kernels

  • 5x5 is replaced by 2 layers of 3x3, reduce param num, 25 -> 2x9 = 18

  • model is deeper, can extract more nonlinear feature

GoogLeNet

  • object have different sizes in different images, want to use multi-scale kernels

  • use 1x1, 3x3, 5x5 kernels in inception module

  • use 1x1 kernel to reduce param num

ResNet

  • when model goes deeper, it tends to overfit, and gradients will vanish or explode

  • model with 56 layer is worse than model with 20 layer, so model cannot learn f(x) = x easily

  • use shortcut connection to keep the identity mapping

EfficientNet

  • changes model depth, width, and resolution to achieve the best accuracy and reduce FLOPS

image segmentation

  • unlike classification, the feature maps in segmentation has similar size to original image

  • transpose convolution is used to decode the feature, so that output is close to original image size

    • pad input feature with 0

    • flip the kernel upside down, right side left

    • do conv with stride = 1, padding = 0

class torch.nn.ConvTranspose2d(in_channels, 
                               out_channels,
                               kernel_size,
                               stride=1,
                               padding=0,
                               groups=1,
                               bias=True,
                               dilation=1)
  • Dice loss

    • when the model outputs probability, instead of 0 or 1, it is soft dice loss, |P intersect G| approx to prob dot GT

  • evaluation uses mIoU, where m means the mean of all classes

NLP

  • to find keywords

    • can use term frequency inverse doc freq TF-IDF, if a word appears in a doc more, then it is more important, but if a word appears in the database more, then it is less important

    • text rank is similar to page rank, if a webpage is linked by many other pages, then it is more important, and if a page is important, then the pages it links will also be important

    • Latent Dirichlet Allocation, based on topic

  • n-gram model, the word depends on previous words, eg 2-gram means w2 depends on w1

  • use a vector to represent a word, the distance between words represents the similarity

  • attention, input is query, key, value, output is attention value

    • query is the output state of last time, zt-1, and key, value are the hidden state

    • output depends on similarity between Q, K

    • Attention(Q, K, V) = softmax(sim(Q, K))V

Share this post
Pytorch learning notes
moshan.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNew

No posts

Ready for more?

© 2022 Mo Shan
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing