# Pytorch learning notes

**about**

geek time pytorch course learning notes

**basics**

**numpy**

for reshape, it will first unroll all elements based on layout, then transform to final shape

to sum along axis 0, sum the values vertically, to sum along axis 1, sum the values horizontally

```
import numpy as np
array1 = np.array(
[[1, 2],
[3, 4],
[5, 6]])
total_0_axis = np.sum(array1, axis=0)
print(f'Sum of elements at 0-axis is {total_0_axis}')
total_1_axis = np.sum(array1, axis=1)
print(f'Sum of elements at 1-axis is {total_1_axis}')
Output:
Sum of elements at 0-axis is [ 9 12]
Sum of elements at 1-axis is [ 3 7 11]
```

use

`newaxis`

to create axis before doing concatenationshallow copy is a view of the original array, it does not share the shape, but share the values, ie changing the shape of the shallow copy will not change the shape of the original, but changing the value of the shallow copy will also change the value of the original

use argmax, argsort to find out the index of the maximum probability

**tensor**

use permute to change the size of tensor

permute(a, b, c) means put axis a to axis 0, put axis b to axis 1, and put axis c to axis 2

```
x = torch.rand(2, 3, 4)
x = x.permute(2, 1, 0)
x.shape
torch.Size([4, 3, 2])
```

unlike permute, transpose can only swap two axes

after using transpose or permute, the storage is no longer continuous, so we cannot use view to change its size, need to use reshape instead

use unsqueeze to augment dimension

```
x = torch.rand(2,1,3)
y = x.unsqueeze(2)
y.shape
torch.Size([2,1,1,3])
```

concatenate tensors, dim=0 means stack along the row

```
torch.cat(tensors, dim=0, out=None)
```

stack creates new dimension when concatenating tensors

```
torch.stack(inputs, dim=0)
```

chunk splits the tensor evenly,

`chunks`

is the number of division, must be int, eg if we input 1D tensor with size 10, then we get a tuple with two size 5 tensorif not divisible, eg input size 17, chunk number is 4, then 17/4=4.25, use ceil to get 5, so final chunk size will be 5,5,5,2

```
torch.chunk(input, chunks, dim=0)
```

if instead of how many divisions, we want to specify the number in each division, then use split

```
torch.split(tensor, split_size_or_sections, dim=0)
```

unbind is equivalent to chunk size = dimension size, or split_size_or_sections = 1

```
torch.unbind(input, dim=0)
```

we can use

`index_select`

or`masked_select`

to select index, masked_select only selects those where it is true in mask

**data handling**

we can inherit Dataset class to load the dataset, ie implement

`__init__()`

,`__len__()`

,`__getitem__()`

```
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data_tensor, target_tensor):
self.data_tensor = data_tensor
self.target_tensor = target_tensor
def __len__(self):
return self.data_tensor.size(0)
def __getitem__(self, index):
return self.data_tensor[index], self.target_tensor[index]
```

`DataLoader`

is an iterator, which takes a Dataset instance as input, and generates a training sample based on`batch_size`

, and can use multi-thread

```
from torch.utils.data import DataLoader
tensor_dataloader = DataLoader(dataset=my_dataset,
batch_size=2,
shuffle=True,
num_workers=0)
```

`torchvision`

implements ways to handle dataset, process image, and use some networks, eg to load mnist, use

```
import torchvision
mnist_dataset = torchvision.datasets.MNIST(
root='./data',
train=True,
transform=None,
target_transoform=None,
download=True
)
```

`train`

means load trainset, otherwise will load testset, transform means preprocess the image, target_transform means preprocess the labelsthe return type is

`torchvision.datasets.mnist.MNIST`

, which is an inherited class of Dataset, so it provides len, getitem, etc

use

`transforms.ToTensor()`

to convert PIL.Image or Numpy.ndarray to Tensor, and use`transforms.ToPILImage(mode=None)`

to convert tensor to PIL.Imagein transforms,

`standarization`

means`output = (input - mean)/std`

the purpose is to make sure all images have similar distribution, so that training is more likely to converge

use pretrained model to do fine tuning

```
import torch
import torchvision.models as models
googlenet = models.googlenet(pretrained=True)
fc_in_features = googlenet.fc.in_features
googlenet.fc = torch.nn.Linear(fc_in_features, 10)
```

each epoch trains on entire dataset, each step trains on one mini batch

**convolution**

use

`Conv2d`

, when`padding=same`

, it means output size equals to input size

```
class torch.nn.Conv2d(
in_channels,
out_channels,
stride=1,
padding=0,
dilation=1,
groups=1,
bias=True,
padding_mode='zeros',
device=None,
dtype=None
)
```

depthwise separable convolution, it includes depthwise and pointwise conv

input size is m x h x w, output size is n x h’ x w’

DW has m kernels, each has size 3x3

PW conv has n 1x1 conv, each has m channels, used after DW

this is used to reduce number of parameters in model

in conv2d, if groups != 1, then input is divided into groups, when groups = in_channels, it is DW, and input and output channel sizes for DW both equal to input channel size of data

for PW, kernel size is 1, input channel equals to output channel size of DW (=input channel size of original data), and output equals to desired output n

dilation is used for outputing pixel-wise segmentation, the conv kernel size is dilated by 0

**visualization**

**tensorboard**

to use tensorboard, we need a summarywriter

```
torch.utils.tensorboard.writer.SummaryWriter(log_dir=None)
```

to record scalar, use add_scalar

tag is the name of data

scalar value is a float for the value

global step is the num of training steps

walltime is the timestamp

```
add_scalar(tag, scalar_value, global_step=None, walltime=None)
```

use add_image to record image

img tensor is Tensor or numpy

dataformats is the format, eg CHW is channel x height x width

```
add_image(tag, img_tensor, global_step=None, walltime=None, dataformats='CHW')
```

**visdom**

example

```
from visdom import Visdom
import numpy as np
import time
viz = Visdom()
viz.line([0.], [0], win='train_loss', opts=dict(title='train_loss'))
for n_iter in range(10):
loss = 0.2 * np.random.randn() + 1
viz.line([loss], [n_iter], win='train_loss', update='append')
time.sleep(0.5)
img = np.zeros((3, 100, 100))
img[0] = np.arange(0, 10000).reshape(100, 100) / 10000
img[1] = 1 - np.arange(0, 10000).reshape(100, 100) / 10000
viz.image(img)
```

**distributive training**

get gpu

```
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
```

use multiple gpu on one machine, we need

`DataParallel`

(DP)device ids are the gpus used for training, output device is the one for output, default is gpu 0

the loss are computed concurrently, but all loss will be summed in output device, so this one needs more workload

```
torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
```

in forward pass, data will split to multiple gpus, but the model will be copied to each gpu

use multiple gpus on multiple machines, we need

`DistributedDataParallel`

(DDP)DP uses single thread to control multiple gpus, in backprop default gpu updates the param, so its workload is higher

DDP uses multi threads to control multiple gpus, it uses DistributedSampler to load data, to make sure there is no overlap between data, and in backprop the gpus update params in each of them

DDP can also be used on one machine with multiple gpus

DDP concepts

`group`

is a thread, default is 1`world_size`

means the total thread numrank means the priorty of thread, default machine’s rank is 0

DDP step 1, use

`init_process_group`

backend is

`nccl`

, used for gpu traininginit method is

`env://`

, initiailized from env variableworld size is the num of machines

rank is the rank of the current machine

group name is the name of the group

```
torch.distributed.init_process_group(backend="nccl", init_method=None, world_size=-1, rank=0, group_name=None)
```

DDP step2, send model to the gpus

```
torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device)
net = torch.nn.parallel.DistributedDataParallel(net)
```

DDP step3, use

`DistributedSampler`

to split the data and send to gpusin DDP, we do not send data from main gpu, instead, each gpu load its own data

when we use sampler, do not use data shuffle

```
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
data_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)
```

**image classification**

**VGG**

use 3x3 kernel to replace 11x11, 7x7, 5x5 kernels

5x5 is replaced by 2 layers of 3x3, reduce param num, 25 -> 2x9 = 18

model is deeper, can extract more nonlinear feature

**GoogLeNet**

object have different sizes in different images, want to use multi-scale kernels

use 1x1, 3x3, 5x5 kernels in inception module

use 1x1 kernel to reduce param num

**ResNet**

when model goes deeper, it tends to overfit, and gradients will vanish or explode

model with 56 layer is worse than model with 20 layer, so model cannot learn f(x) = x easily

use shortcut connection to keep the identity mapping

**EfficientNet**

changes model depth, width, and resolution to achieve the best accuracy and reduce FLOPS

**image segmentation**

unlike classification, the feature maps in segmentation has similar size to original image

transpose convolution is used to decode the feature, so that output is close to original image size

pad input feature with 0

flip the kernel upside down, right side left

do conv with stride = 1, padding = 0

```
class torch.nn.ConvTranspose2d(in_channels,
out_channels,
kernel_size,
stride=1,
padding=0,
groups=1,
bias=True,
dilation=1)
```

Dice loss

when the model outputs probability, instead of 0 or 1, it is soft dice loss, |P intersect G| approx to prob dot GT

evaluation uses mIoU, where m means the mean of all classes

**NLP**

to find keywords

can use term frequency inverse doc freq TF-IDF, if a word appears in a doc more, then it is more important, but if a word appears in the database more, then it is less important

text rank is similar to page rank, if a webpage is linked by many other pages, then it is more important, and if a page is important, then the pages it links will also be important

Latent Dirichlet Allocation, based on topic

n-gram model, the word depends on previous words, eg 2-gram means w2 depends on w1

use a vector to represent a word, the distance between words represents the similarity

attention, input is query, key, value, output is attention value

query is the output state of last time, zt-1, and key, value are the hidden state

output depends on similarity between Q, K

Attention(Q, K, V) = softmax(sim(Q, K))V

## Create your profile

## Only paid subscribers can comment on this post

Sign in## Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.