Pytorch learning notes
about
geek time pytorch course learning notes
basics
numpy
for reshape, it will first unroll all elements based on layout, then transform to final shape
to sum along axis 0, sum the values vertically, to sum along axis 1, sum the values horizontally
import numpy as np
array1 = np.array(
[[1, 2],
[3, 4],
[5, 6]])
total_0_axis = np.sum(array1, axis=0)
print(f'Sum of elements at 0-axis is {total_0_axis}')
total_1_axis = np.sum(array1, axis=1)
print(f'Sum of elements at 1-axis is {total_1_axis}')
Output:
Sum of elements at 0-axis is [ 9 12]
Sum of elements at 1-axis is [ 3 7 11]
use
newaxis
to create axis before doing concatenationshallow copy is a view of the original array, it does not share the shape, but share the values, ie changing the shape of the shallow copy will not change the shape of the original, but changing the value of the shallow copy will also change the value of the original
use argmax, argsort to find out the index of the maximum probability
tensor
use permute to change the size of tensor
permute(a, b, c) means put axis a to axis 0, put axis b to axis 1, and put axis c to axis 2
x = torch.rand(2, 3, 4)
x = x.permute(2, 1, 0)
x.shape
torch.Size([4, 3, 2])
unlike permute, transpose can only swap two axes
after using transpose or permute, the storage is no longer continuous, so we cannot use view to change its size, need to use reshape instead
use unsqueeze to augment dimension
x = torch.rand(2,1,3)
y = x.unsqueeze(2)
y.shape
torch.Size([2,1,1,3])
concatenate tensors, dim=0 means stack along the row
torch.cat(tensors, dim=0, out=None)
stack creates new dimension when concatenating tensors
torch.stack(inputs, dim=0)
chunk splits the tensor evenly,
chunks
is the number of division, must be int, eg if we input 1D tensor with size 10, then we get a tuple with two size 5 tensorif not divisible, eg input size 17, chunk number is 4, then 17/4=4.25, use ceil to get 5, so final chunk size will be 5,5,5,2
torch.chunk(input, chunks, dim=0)
if instead of how many divisions, we want to specify the number in each division, then use split
torch.split(tensor, split_size_or_sections, dim=0)
unbind is equivalent to chunk size = dimension size, or split_size_or_sections = 1
torch.unbind(input, dim=0)
we can use
index_select
ormasked_select
to select index, masked_select only selects those where it is true in mask
data handling
we can inherit Dataset class to load the dataset, ie implement
__init__()
,__len__()
,__getitem__()
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data_tensor, target_tensor):
self.data_tensor = data_tensor
self.target_tensor = target_tensor
def __len__(self):
return self.data_tensor.size(0)
def __getitem__(self, index):
return self.data_tensor[index], self.target_tensor[index]
DataLoader
is an iterator, which takes a Dataset instance as input, and generates a training sample based onbatch_size
, and can use multi-thread
from torch.utils.data import DataLoader
tensor_dataloader = DataLoader(dataset=my_dataset,
batch_size=2,
shuffle=True,
num_workers=0)
torchvision
implements ways to handle dataset, process image, and use some networks, eg to load mnist, use
import torchvision
mnist_dataset = torchvision.datasets.MNIST(
root='./data',
train=True,
transform=None,
target_transoform=None,
download=True
)
train
means load trainset, otherwise will load testset, transform means preprocess the image, target_transform means preprocess the labelsthe return type is
torchvision.datasets.mnist.MNIST
, which is an inherited class of Dataset, so it provides len, getitem, etc
use
transforms.ToTensor()
to convert PIL.Image or Numpy.ndarray to Tensor, and usetransforms.ToPILImage(mode=None)
to convert tensor to PIL.Imagein transforms,
standarization
meansoutput = (input - mean)/std
the purpose is to make sure all images have similar distribution, so that training is more likely to converge
use pretrained model to do fine tuning
import torch
import torchvision.models as models
googlenet = models.googlenet(pretrained=True)
fc_in_features = googlenet.fc.in_features
googlenet.fc = torch.nn.Linear(fc_in_features, 10)
each epoch trains on entire dataset, each step trains on one mini batch
convolution
use
Conv2d
, whenpadding=same
, it means output size equals to input size
class torch.nn.Conv2d(
in_channels,
out_channels,
stride=1,
padding=0,
dilation=1,
groups=1,
bias=True,
padding_mode='zeros',
device=None,
dtype=None
)
depthwise separable convolution, it includes depthwise and pointwise conv
input size is m x h x w, output size is n x h’ x w’
DW has m kernels, each has size 3x3
PW conv has n 1x1 conv, each has m channels, used after DW
this is used to reduce number of parameters in model
in conv2d, if groups != 1, then input is divided into groups, when groups = in_channels, it is DW, and input and output channel sizes for DW both equal to input channel size of data
for PW, kernel size is 1, input channel equals to output channel size of DW (=input channel size of original data), and output equals to desired output n
dilation is used for outputing pixel-wise segmentation, the conv kernel size is dilated by 0
visualization
tensorboard
to use tensorboard, we need a summarywriter
torch.utils.tensorboard.writer.SummaryWriter(log_dir=None)
to record scalar, use add_scalar
tag is the name of data
scalar value is a float for the value
global step is the num of training steps
walltime is the timestamp
add_scalar(tag, scalar_value, global_step=None, walltime=None)
use add_image to record image
img tensor is Tensor or numpy
dataformats is the format, eg CHW is channel x height x width
add_image(tag, img_tensor, global_step=None, walltime=None, dataformats='CHW')
visdom
example
from visdom import Visdom
import numpy as np
import time
viz = Visdom()
viz.line([0.], [0], win='train_loss', opts=dict(title='train_loss'))
for n_iter in range(10):
loss = 0.2 * np.random.randn() + 1
viz.line([loss], [n_iter], win='train_loss', update='append')
time.sleep(0.5)
img = np.zeros((3, 100, 100))
img[0] = np.arange(0, 10000).reshape(100, 100) / 10000
img[1] = 1 - np.arange(0, 10000).reshape(100, 100) / 10000
viz.image(img)
distributive training
get gpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
use multiple gpu on one machine, we need
DataParallel
(DP)device ids are the gpus used for training, output device is the one for output, default is gpu 0
the loss are computed concurrently, but all loss will be summed in output device, so this one needs more workload
torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
in forward pass, data will split to multiple gpus, but the model will be copied to each gpu
use multiple gpus on multiple machines, we need
DistributedDataParallel
(DDP)DP uses single thread to control multiple gpus, in backprop default gpu updates the param, so its workload is higher
DDP uses multi threads to control multiple gpus, it uses DistributedSampler to load data, to make sure there is no overlap between data, and in backprop the gpus update params in each of them
DDP can also be used on one machine with multiple gpus
DDP concepts
group
is a thread, default is 1world_size
means the total thread numrank means the priorty of thread, default machine’s rank is 0
DDP step 1, use
init_process_group
backend is
nccl
, used for gpu traininginit method is
env://
, initiailized from env variableworld size is the num of machines
rank is the rank of the current machine
group name is the name of the group
torch.distributed.init_process_group(backend="nccl", init_method=None, world_size=-1, rank=0, group_name=None)
DDP step2, send model to the gpus
torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device)
net = torch.nn.parallel.DistributedDataParallel(net)
DDP step3, use
DistributedSampler
to split the data and send to gpusin DDP, we do not send data from main gpu, instead, each gpu load its own data
when we use sampler, do not use data shuffle
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)
image classification
VGG
use 3x3 kernel to replace 11x11, 7x7, 5x5 kernels
5x5 is replaced by 2 layers of 3x3, reduce param num, 25 -> 2x9 = 18
model is deeper, can extract more nonlinear feature
GoogLeNet
object have different sizes in different images, want to use multi-scale kernels
use 1x1, 3x3, 5x5 kernels in inception module
use 1x1 kernel to reduce param num
ResNet
when model goes deeper, it tends to overfit, and gradients will vanish or explode
model with 56 layer is worse than model with 20 layer, so model cannot learn f(x) = x easily
use shortcut connection to keep the identity mapping
EfficientNet
changes model depth, width, and resolution to achieve the best accuracy and reduce FLOPS
image segmentation
unlike classification, the feature maps in segmentation has similar size to original image
transpose convolution is used to decode the feature, so that output is close to original image size
pad input feature with 0
flip the kernel upside down, right side left
do conv with stride = 1, padding = 0
class torch.nn.ConvTranspose2d(in_channels,
out_channels,
kernel_size,
stride=1,
padding=0,
groups=1,
bias=True,
dilation=1)
Dice loss
when the model outputs probability, instead of 0 or 1, it is soft dice loss, |P intersect G| approx to prob dot GT
evaluation uses mIoU, where m means the mean of all classes
NLP
to find keywords
can use term frequency inverse doc freq TF-IDF, if a word appears in a doc more, then it is more important, but if a word appears in the database more, then it is less important
text rank is similar to page rank, if a webpage is linked by many other pages, then it is more important, and if a page is important, then the pages it links will also be important
Latent Dirichlet Allocation, based on topic
n-gram model, the word depends on previous words, eg 2-gram means w2 depends on w1
use a vector to represent a word, the distance between words represents the similarity
attention, input is query, key, value, output is attention value
query is the output state of last time, zt-1, and key, value are the hidden state
output depends on similarity between Q, K
Attention(Q, K, V) = softmax(sim(Q, K))V