YOLOv11改进 | 主干/Backbone篇 | 轻量化目标检测网络MobileViTv2改进yolov11助力轻量化模型

一、本文介绍

本文给大家带来的改进机制是 MobileViT系列的V2版本 ,其作为 MobileNet 网络的挑战者,其效果自然不用多说,MobileViT 模型 是为 移动设备 设计的轻量级、通用目的视觉变换器。它融合了卷积 神经网络 (CNN)和视觉变换器(ViT)的优势,旨在在保持高效性能的同时减少模型参数和降低延迟。通过其创新的 MobileViT Block和多尺度训练方法,MobileViT在多个视觉任务上取得了优异的结果,

欢迎大家订阅我的专栏一起学习YOLO!

18b14562d03248b09ab6dd11af1b95bf.png


目录

一、本文介绍

二、原理介绍

三、核心代码

四、手把手教你添加MobileViTv2

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

注意!!! 额外的修改!

打印计算量问题解决方案

注意事项!!!

五、MobileViTv2的yaml文件

5.1 MobileViTv2的yaml文件

5.2 训练文件的代码

六、成功运行记录

七、本文总结


二、原理介绍

4a8ea621b3e447ddad955b9f0e48ac53.png

官方论文地址: 官方论文地址点击此处即可跳转

官方代码地址: 官方代码地址点击此处即可跳转

1dc78ab1675a40dab29f7e619204027e.png


2048af7e1db241609328cf38545eba06.png

MobileViTv2论文《Separable Self-attention for Mobile Vision Transformers》由Sachin Mehta和Mohammad Rastegari撰写,旨在通过引入一种线性复杂度的可分离自注意力方法,解决MobileViT模型在资源受限设备上的高延迟问题。该论文展示了MobileViTv2在多项移动视觉任务中的领先性能,包括ImageNet对象分类和MS-COCO对象检测。使用约三百万参数,MobileViTv2在ImageNet数据集上实现了75.6%的Top-1准确率,比MobileViT提高了约1%,同时在移动设备上的运行速度提高了3.2倍。

主要贡献和特点

1. 可分离自注意力:引入了一种具有线性复杂度(O(k))的自注意力方法,通过元素级操作计算自注意力,适合资源受限设备。

2. 提高效率:与传统的多头自注意力(MHA)相比,该方法降低了计算复杂度,减少了运算成本,加快了模型在资源受限设备上的推理速度。

3. 卓越性能:MobileViTv2在不同的移动视觉任务上取得了优异的性能,证明了其作为轻量级视觉变换器的有效性和实用性。

总结: MobileViTv2通过其创新的可分离自注意力机制,在保持轻量级的同时,实现了在多个移动视觉任务上的高性能,特别适合在计算能力受限的移动设备上部署。论文提供的开 源代码 进一步促进了该领域的研究和发展。


三、核心代码

代码的使用方式看章节四!

  1. """
  2. original code from apple:
  3. https://github.com/apple/ml-cvnets/blob/main/cvnets/models/classification/mobilevit.py
  4. """
  5. import math
  6. import numpy as np
  7. import torch
  8. import torch.nn as nn
  9. from torch import Tensor
  10. from torch.nn import functional as F
  11. from typing import Tuple, Dict, Sequence
  12. from typing import Union, Optional
  13. __all__ = ['mobile_vit2_xx_small']
  14. def make_divisible(
  15. v: Union[float, int],
  16. divisor: Optional[int] = 8,
  17. min_value: Optional[Union[float, int]] = None,
  18. ) -> Union[float, int]:
  19. """
  20. This function is taken from the original tf repo.
  21. It ensures that all layers have a channel number that is divisible by 8
  22. It can be seen here:
  23. https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
  24. :param v:
  25. :param divisor:
  26. :param min_value:
  27. :return:
  28. """
  29. if min_value is None:
  30. min_value = divisor
  31. new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
  32. # Make sure that round down does not go down by more than 10%.
  33. if new_v < 0.9 * v:
  34. new_v += divisor
  35. return new_v
  36. def bound_fn(
  37. min_val: Union[float, int], max_val: Union[float, int], value: Union[float, int]
  38. ) -> Union[float, int]:
  39. return max(min_val, min(max_val, value))
  40. def get_config(mode: str = "xxs") -> dict:
  41. width_multiplier = 0.5
  42. ffn_multiplier = 2
  43. layer_0_dim = bound_fn(min_val=16, max_val=64, value=32 * width_multiplier)
  44. layer_0_dim = int(make_divisible(layer_0_dim, divisor=8, min_value=16))
  45. # print("layer_0_dim: ", layer_0_dim)
  46. if mode == "xx_small":
  47. mv2_exp_mult = 2
  48. config = {
  49. "layer1": {
  50. "out_channels": 16,
  51. "expand_ratio": mv2_exp_mult,
  52. "num_blocks": 1,
  53. "stride": 1,
  54. "block_type": "mv2",
  55. },
  56. "layer2": {
  57. "out_channels": 24,
  58. "expand_ratio": mv2_exp_mult,
  59. "num_blocks": 3,
  60. "stride": 2,
  61. "block_type": "mv2",
  62. },
  63. "layer3": { # 28x28
  64. "out_channels": 48,
  65. "transformer_channels": 64,
  66. "ffn_dim": 128,
  67. "transformer_blocks": 2,
  68. "patch_h": 2, # 8,
  69. "patch_w": 2, # 8,
  70. "stride": 2,
  71. "mv_expand_ratio": mv2_exp_mult,
  72. "num_heads": 4,
  73. "block_type": "mobilevit",
  74. },
  75. "layer4": { # 14x14
  76. "out_channels": 64,
  77. "transformer_channels": 80,
  78. "ffn_dim": 160,
  79. "transformer_blocks": 4,
  80. "patch_h": 2, # 4,
  81. "patch_w": 2, # 4,
  82. "stride": 2,
  83. "mv_expand_ratio": mv2_exp_mult,
  84. "num_heads": 4,
  85. "block_type": "mobilevit",
  86. },
  87. "layer5": { # 7x7
  88. "out_channels": 80,
  89. "transformer_channels": 96,
  90. "ffn_dim": 192,
  91. "transformer_blocks": 3,
  92. "patch_h": 2,
  93. "patch_w": 2,
  94. "stride": 2,
  95. "mv_expand_ratio": mv2_exp_mult,
  96. "num_heads": 4,
  97. "block_type": "mobilevit",
  98. },
  99. "last_layer_exp_factor": 4,
  100. "cls_dropout": 0.1
  101. }
  102. elif mode == "x_small":
  103. mv2_exp_mult = 4
  104. config = {
  105. "layer1": {
  106. "out_channels": 32,
  107. "expand_ratio": mv2_exp_mult,
  108. "num_blocks": 1,
  109. "stride": 1,
  110. "block_type": "mv2",
  111. },
  112. "layer2": {
  113. "out_channels": 48,
  114. "expand_ratio": mv2_exp_mult,
  115. "num_blocks": 3,
  116. "stride": 2,
  117. "block_type": "mv2",
  118. },
  119. "layer3": { # 28x28
  120. "out_channels": 64,
  121. "transformer_channels": 96,
  122. "ffn_dim": 192,
  123. "transformer_blocks": 2,
  124. "patch_h": 2,
  125. "patch_w": 2,
  126. "stride": 2,
  127. "mv_expand_ratio": mv2_exp_mult,
  128. "num_heads": 4,
  129. "block_type": "mobilevit",
  130. },
  131. "layer4": { # 14x14
  132. "out_channels": 80,
  133. "transformer_channels": 120,
  134. "ffn_dim": 240,
  135. "transformer_blocks": 4,
  136. "patch_h": 2,
  137. "patch_w": 2,
  138. "stride": 2,
  139. "mv_expand_ratio": mv2_exp_mult,
  140. "num_heads": 4,
  141. "block_type": "mobilevit",
  142. },
  143. "layer5": { # 7x7
  144. "out_channels": 96,
  145. "transformer_channels": 144,
  146. "ffn_dim": 288,
  147. "transformer_blocks": 3,
  148. "patch_h": 2,
  149. "patch_w": 2,
  150. "stride": 2,
  151. "mv_expand_ratio": mv2_exp_mult,
  152. "num_heads": 4,
  153. "block_type": "mobilevit",
  154. },
  155. "last_layer_exp_factor": 4,
  156. "cls_dropout": 0.1
  157. }
  158. elif mode == "small":
  159. mv2_exp_mult = 4
  160. config = {
  161. "layer1": {
  162. "out_channels": 32,
  163. "expand_ratio": mv2_exp_mult,
  164. "num_blocks": 1,
  165. "stride": 1,
  166. "block_type": "mv2",
  167. },
  168. "layer2": {
  169. "out_channels": 64,
  170. "expand_ratio": mv2_exp_mult,
  171. "num_blocks": 3,
  172. "stride": 2,
  173. "block_type": "mv2",
  174. },
  175. "layer3": { # 28x28
  176. "out_channels": 96,
  177. "transformer_channels": 144,
  178. "ffn_dim": 288,
  179. "transformer_blocks": 2,
  180. "patch_h": 2,
  181. "patch_w": 2,
  182. "stride": 2,
  183. "mv_expand_ratio": mv2_exp_mult,
  184. "num_heads": 4,
  185. "block_type": "mobilevit",
  186. },
  187. "layer4": { # 14x14
  188. "out_channels": 128,
  189. "transformer_channels": 192,
  190. "ffn_dim": 384,
  191. "transformer_blocks": 4,
  192. "patch_h": 2,
  193. "patch_w": 2,
  194. "stride": 2,
  195. "mv_expand_ratio": mv2_exp_mult,
  196. "num_heads": 4,
  197. "block_type": "mobilevit",
  198. },
  199. "layer5": { # 7x7
  200. "out_channels": 160,
  201. "transformer_channels": 240,
  202. "ffn_dim": 480,
  203. "transformer_blocks": 3,
  204. "patch_h": 2,
  205. "patch_w": 2,
  206. "stride": 2,
  207. "mv_expand_ratio": mv2_exp_mult,
  208. "num_heads": 4,
  209. "block_type": "mobilevit",
  210. },
  211. "last_layer_exp_factor": 4,
  212. "cls_dropout": 0.1
  213. }
  214. elif mode == "2xx_small":
  215. mv2_exp_mult = 2
  216. config = {
  217. "layer0": {
  218. "img_channels": 3,
  219. "out_channels": layer_0_dim,
  220. },
  221. "layer1": {
  222. "out_channels": int(make_divisible(64 * width_multiplier, divisor=16)),
  223. "expand_ratio": mv2_exp_mult,
  224. "num_blocks": 1,
  225. "stride": 1,
  226. "block_type": "mv2",
  227. },
  228. "layer2": {
  229. "out_channels": int(make_divisible(128 * width_multiplier, divisor=8)),
  230. "expand_ratio": mv2_exp_mult,
  231. "num_blocks": 2,
  232. "stride": 2,
  233. "block_type": "mv2",
  234. },
  235. "layer3": { # 28x28
  236. "out_channels": int(make_divisible(256 * width_multiplier, divisor=8)),
  237. "attn_unit_dim": int(make_divisible(128 * width_multiplier, divisor=8)),
  238. "ffn_multiplier": ffn_multiplier,
  239. "attn_blocks": 2,
  240. "patch_h": 2,
  241. "patch_w": 2,
  242. "stride": 2,
  243. "mv_expand_ratio": mv2_exp_mult,
  244. "block_type": "mobilevit",
  245. },
  246. "layer4": { # 14x14
  247. "out_channels": int(make_divisible(384 * width_multiplier, divisor=8)),
  248. "attn_unit_dim": int(make_divisible(192 * width_multiplier, divisor=8)),
  249. "ffn_multiplier": ffn_multiplier,
  250. "attn_blocks": 4,
  251. "patch_h": 2,
  252. "patch_w": 2,
  253. "stride": 2,
  254. "mv_expand_ratio": mv2_exp_mult,
  255. "block_type": "mobilevit",
  256. },
  257. "layer5": { # 7x7
  258. "out_channels": int(make_divisible(512 * width_multiplier, divisor=8)),
  259. "attn_unit_dim": int(make_divisible(256 * width_multiplier, divisor=8)),
  260. "ffn_multiplier": ffn_multiplier,
  261. "attn_blocks": 3,
  262. "patch_h": 2,
  263. "patch_w": 2,
  264. "stride": 2,
  265. "mv_expand_ratio": mv2_exp_mult,
  266. "block_type": "mobilevit",
  267. },
  268. "last_layer_exp_factor": 4,
  269. }
  270. else:
  271. raise NotImplementedError
  272. for k in ["layer1", "layer2", "layer3", "layer4", "layer5"]:
  273. config[k].update({"dropout": 0.1, "ffn_dropout": 0.0, "attn_dropout": 0.0})
  274. return config
  275. class ConvLayer(nn.Module):
  276. """
  277. Applies a 2D convolution over an input
  278. Args:
  279. in_channels (int): :math:`C_{in}` from an expected input of size :math:`(N, C_{in}, H_{in}, W_{in})`
  280. out_channels (int): :math:`C_{out}` from an expected output of size :math:`(N, C_{out}, H_{out}, W_{out})`
  281. kernel_size (Union[int, Tuple[int, int]]): Kernel size for convolution.
  282. stride (Union[int, Tuple[int, int]]): Stride for convolution. Default: 1
  283. groups (Optional[int]): Number of groups in convolution. Default: 1
  284. bias (Optional[bool]): Use bias. Default: ``False``
  285. use_norm (Optional[bool]): Use normalization layer after convolution. Default: ``True``
  286. use_act (Optional[bool]): Use activation layer after convolution (or convolution and normalization).
  287. Default: ``True``
  288. Shape:
  289. - Input: :math:`(N, C_{in}, H_{in}, W_{in})`
  290. - Output: :math:`(N, C_{out}, H_{out}, W_{out})`
  291. .. note::
  292. For depth-wise convolution, `groups=C_{in}=C_{out}`.
  293. """
  294. def __init__(
  295. self,
  296. in_channels: int, # 输入通道数
  297. out_channels: int, # 输出通道数
  298. kernel_size: Union[int, Tuple[int, int]], # 卷积核大小
  299. stride: Optional[Union[int, Tuple[int, int]]] = 1, # 步长
  300. groups: Optional[int] = 1, # 分组卷积
  301. bias: Optional[bool] = False, # 是否使用偏置
  302. use_norm: Optional[bool] = True, # 是否使用归一化
  303. use_act: Optional[bool] = True, # 是否使用激活函数
  304. ) -> None:
  305. super().__init__()
  306. if isinstance(kernel_size, int):
  307. kernel_size = (kernel_size, kernel_size)
  308. if isinstance(stride, int):
  309. stride = (stride, stride)
  310. assert isinstance(kernel_size, Tuple)
  311. assert isinstance(stride, Tuple)
  312. padding = (
  313. int((kernel_size[0] - 1) / 2),
  314. int((kernel_size[1] - 1) / 2),
  315. )
  316. block = nn.Sequential()
  317. conv_layer = nn.Conv2d(
  318. in_channels=in_channels,
  319. out_channels=out_channels,
  320. kernel_size=kernel_size,
  321. stride=stride,
  322. groups=groups,
  323. padding=padding,
  324. bias=bias
  325. )
  326. block.add_module(name="conv", module=conv_layer)
  327. if use_norm:
  328. norm_layer = nn.BatchNorm2d(num_features=out_channels, momentum=0.1) # BatchNorm2d
  329. block.add_module(name="norm", module=norm_layer)
  330. if use_act:
  331. act_layer = nn.SiLU() # Swish activation
  332. block.add_module(name="act", module=act_layer)
  333. self.block = block
  334. def forward(self, x: Tensor) -> Tensor:
  335. return self.block(x)
  336. class MultiHeadAttention(nn.Module):
  337. """
  338. This layer applies a multi-head self- or cross-attention as described in
  339. `Attention is all you need <https://arxiv.org/abs/1706.03762>`_ paper
  340. Args:
  341. embed_dim (int): :math:`C_{in}` from an expected input of size :math:`(N, P, C_{in})`
  342. num_heads (int): Number of heads in multi-head attention
  343. attn_dropout (float): Attention dropout. Default: 0.0
  344. bias (bool): Use bias or not. Default: ``True``
  345. Shape:
  346. - Input: :math:`(N, P, C_{in})` where :math:`N` is batch size, :math:`P` is number of patches,
  347. and :math:`C_{in}` is input embedding dim
  348. - Output: same shape as the input
  349. """
  350. def __init__(
  351. self,
  352. embed_dim: int,
  353. num_heads: int,
  354. attn_dropout: float = 0.0,
  355. bias: bool = True,
  356. *args,
  357. **kwargs
  358. ) -> None:
  359. super().__init__()
  360. if embed_dim % num_heads != 0:
  361. raise ValueError(
  362. "Embedding dim must be divisible by number of heads in {}. Got: embed_dim={} and num_heads={}".format(
  363. self.__class__.__name__, embed_dim, num_heads
  364. )
  365. )
  366. self.qkv_proj = nn.Linear(in_features=embed_dim, out_features=3 * embed_dim, bias=bias)
  367. self.attn_dropout = nn.Dropout(p=attn_dropout)
  368. self.out_proj = nn.Linear(in_features=embed_dim, out_features=embed_dim, bias=bias)
  369. self.head_dim = embed_dim // num_heads
  370. self.scaling = self.head_dim ** -0.5
  371. self.softmax = nn.Softmax(dim=-1)
  372. self.num_heads = num_heads
  373. self.embed_dim = embed_dim
  374. def forward(self, x_q: Tensor) -> Tensor:
  375. # [N, P, C]
  376. b_sz, n_patches, in_channels = x_q.shape
  377. # self-attention
  378. # [N, P, C] -> [N, P, 3C] -> [N, P, 3, h, c] where C = hc
  379. qkv = self.qkv_proj(x_q).reshape(b_sz, n_patches, 3, self.num_heads, -1)
  380. # [N, P, 3, h, c] -> [N, h, 3, P, C]
  381. qkv = qkv.transpose(1, 3).contiguous()
  382. # [N, h, 3, P, C] -> [N, h, P, C] x 3
  383. query, key, value = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
  384. query = query * self.scaling
  385. # [N h, P, c] -> [N, h, c, P]
  386. key = key.transpose(-1, -2)
  387. # QK^T
  388. # [N, h, P, c] x [N, h, c, P] -> [N, h, P, P]
  389. attn = torch.matmul(query, key)
  390. attn = self.softmax(attn)
  391. attn = self.attn_dropout(attn)
  392. # weighted sum
  393. # [N, h, P, P] x [N, h, P, c] -> [N, h, P, c]
  394. out = torch.matmul(attn, value)
  395. # [N, h, P, c] -> [N, P, h, c] -> [N, P, C]
  396. out = out.transpose(1, 2).reshape(b_sz, n_patches, -1)
  397. out = self.out_proj(out)
  398. return out
  399. class TransformerEncoder(nn.Module):
  400. """
  401. This class defines the pre-norm `Transformer encoder <https://arxiv.org/abs/1706.03762>`_
  402. Args:
  403. embed_dim (int): :math:`C_{in}` from an expected input of size :math:`(N, P, C_{in})`
  404. ffn_latent_dim (int): Inner dimension of the FFN
  405. num_heads (int) : Number of heads in multi-head attention. Default: 8
  406. attn_dropout (float): Dropout rate for attention in multi-head attention. Default: 0.0
  407. dropout (float): Dropout rate. Default: 0.0
  408. ffn_dropout (float): Dropout between FFN layers. Default: 0.0
  409. Shape:
  410. - Input: :math:`(N, P, C_{in})` where :math:`N` is batch size, :math:`P` is number of patches,
  411. and :math:`C_{in}` is input embedding dim
  412. - Output: same shape as the input
  413. """
  414. def __init__(
  415. self,
  416. embed_dim: int,
  417. ffn_latent_dim: int,
  418. num_heads: Optional[int] = 8,
  419. attn_dropout: Optional[float] = 0.0,
  420. dropout: Optional[float] = 0.0,
  421. ffn_dropout: Optional[float] = 0.0,
  422. *args,
  423. **kwargs
  424. ) -> None:
  425. super().__init__()
  426. attn_unit = MultiHeadAttention(
  427. embed_dim,
  428. num_heads,
  429. attn_dropout=attn_dropout,
  430. bias=True
  431. )
  432. self.pre_norm_mha = nn.Sequential(
  433. nn.LayerNorm(embed_dim),
  434. attn_unit,
  435. nn.Dropout(p=dropout)
  436. )
  437. self.pre_norm_ffn = nn.Sequential(
  438. nn.LayerNorm(embed_dim),
  439. nn.Linear(in_features=embed_dim, out_features=ffn_latent_dim, bias=True),
  440. nn.SiLU(),
  441. nn.Dropout(p=ffn_dropout),
  442. nn.Linear(in_features=ffn_latent_dim, out_features=embed_dim, bias=True),
  443. nn.Dropout(p=dropout)
  444. )
  445. self.embed_dim = embed_dim
  446. self.ffn_dim = ffn_latent_dim
  447. self.ffn_dropout = ffn_dropout
  448. self.std_dropout = dropout
  449. def forward(self, x: Tensor) -> Tensor:
  450. # multi-head attention
  451. res = x
  452. x = self.pre_norm_mha(x)
  453. x = x + res
  454. # feed forward network
  455. x = x + self.pre_norm_ffn(x)
  456. return x
  457. class LinearSelfAttention(nn.Module):
  458. """
  459. This layer applies a self-attention with linear complexity, as described in `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ paper.
  460. This layer can be used for self- as well as cross-attention.
  461. Args:
  462. opts: command line arguments
  463. embed_dim (int): :math:`C` from an expected input of size :math:`(N, C, H, W)`
  464. attn_dropout (Optional[float]): Dropout value for context scores. Default: 0.0
  465. bias (Optional[bool]): Use bias in learnable layers. Default: True
  466. Shape:
  467. - Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels,
  468. :math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches
  469. - Output: same as the input
  470. .. note::
  471. For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels
  472. in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor,
  473. we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be
  474. expensive on resource-constrained devices) that may be required to convert the unfolded tensor from
  475. channel-first to channel-last format in case of a linear layer.
  476. """
  477. def __init__(self,
  478. embed_dim: int,
  479. attn_dropout: Optional[float] = 0.0,
  480. bias: Optional[bool] = True,
  481. *args,
  482. **kwargs) -> None:
  483. super().__init__()
  484. self.attn_dropout = nn.Dropout(p=attn_dropout)
  485. self.qkv_proj = ConvLayer(
  486. in_channels=embed_dim,
  487. out_channels=embed_dim * 2 + 1,
  488. kernel_size=1,
  489. bias=bias,
  490. use_norm=False,
  491. use_act=False
  492. )
  493. self.out_proj = ConvLayer(
  494. in_channels=embed_dim,
  495. out_channels=embed_dim,
  496. bias=bias,
  497. kernel_size=1,
  498. use_norm=False,
  499. use_act=False,
  500. )
  501. self.embed_dim = embed_dim
  502. def forward(self, x: Tensor, x_prev: Optional[Tensor] = None, *args, **kwargs) -> Tensor:
  503. if x_prev is None:
  504. return self._forward_self_attn(x, *args, **kwargs)
  505. else:
  506. return self._forward_cross_attn(x, x_prev, *args, **kwargs)
  507. def _forward_self_attn(self, x: Tensor, *args, **kwargs) -> Tensor:
  508. # [B, C, P, N] --> [B, h + 2d, P, N]
  509. qkv = self.qkv_proj(x)
  510. # [B, h + 2d, P, N] --> [B, h, P, N], [B, d, P, N], [B, 1, P, N]
  511. # Query --> [B, 1, P ,N]
  512. # Value, key --> [B, d, P, N]
  513. query, key, value = torch.split(
  514. qkv, [1, self.embed_dim, self.embed_dim], dim=1
  515. )
  516. # 在M通道上做softmax
  517. context_scores = F.softmax(query, dim=-1)
  518. context_scores = self.attn_dropout(context_scores)
  519. # Compute context vector
  520. # [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N]
  521. context_vector = key * context_scores
  522. # [B, d, P, N] --> [B, d, P, 1]
  523. context_vector = context_vector.sum(dim=-1, keepdim=True)
  524. # combine context vector with values
  525. # [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N]
  526. out = F.relu(value) * context_vector.expand_as(value)
  527. out = self.out_proj(out)
  528. return out
  529. def _forward_cross_attn(
  530. self, x: Tensor, x_prev: Optional[Tensor] = None, *args, **kwargs):
  531. # x --> [B, C, P, N]
  532. # x_prev --> [B, C, P, N]
  533. batch_size, in_dim, kv_patch_area, kv_num_patches = x.shape
  534. q_patch_area, q_num_patches = x.shape[-2:]
  535. assert (
  536. kv_patch_area == q_patch_area
  537. ), "The number of patches in the query and key-value tensors must be the same"
  538. # compute query, key, and value
  539. # [B, C, P, M] --> [B, 1 + d, P, M]
  540. qk = F.conv2d(
  541. x_prev,
  542. weight=self.qkv_proj.block.conv.weight[: self.embed_dim + 1, ...],
  543. bias=self.qkv_proj.block.conv.bias[: self.embed_dim + 1, ...],
  544. )
  545. # [B, 1 + d, P, M] --> [B, 1, P, M], [B, d, P, M]
  546. query, key = torch.split(qk, split_size_or_sections=[1, self.embed_dim], dim=1)
  547. # [B, C, P, N] --> [B, d, P, N]
  548. value = F.conv2d(
  549. x,
  550. weight=self.qkv_proj.block.conv.weight[self.embed_dim + 1:, ...],
  551. bias=self.qkv_proj.block.conv.bias[self.embed_dim + 1:, ...],
  552. )
  553. context_scores = F.softmax(query, dim=-1)
  554. context_scores = self.attn_dropout(context_scores)
  555. context_vector = key * context_scores
  556. context_vector = torch.sum(context_vector, dim=-1, keepdim=True)
  557. out = F.relu(value) * context_vector.expand_as(value)
  558. out = self.out_proj(out)
  559. return out
  560. class LinearAttnFFN(nn.Module):
  561. """
  562. This class defines the pre-norm transformer encoder with linear self-attention in `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ paper
  563. Args:
  564. embed_dim (int): :math:`C_{in}` from an expected input of size :math:`(B, C_{in}, P, N)`
  565. ffn_latent_dim (int): Inner dimension of the FFN
  566. attn_dropout (Optional[float]): Dropout rate for attention in multi-head attention. Default: 0.0
  567. dropout (Optional[float]): Dropout rate. Default: 0.0
  568. ffn_dropout (Optional[float]): Dropout between FFN layers. Default: 0.0
  569. norm_layer (Optional[str]): Normalization layer. Default: layer_norm_2d
  570. Shape:
  571. - Input: :math:`(B, C_{in}, P, N)` where :math:`B` is batch size, :math:`C_{in}` is input embedding dim,
  572. :math:`P` is number of pixels in a patch, and :math:`N` is number of patches,
  573. - Output: same shape as the input
  574. """
  575. def __init__(
  576. self,
  577. embed_dim: int,
  578. ffn_latent_dim: int,
  579. attn_dropout: Optional[float] = 0.0,
  580. dropout: Optional[float] = 0.1,
  581. ffn_dropout: Optional[float] = 0.0,
  582. *args,
  583. **kwargs
  584. ) -> None:
  585. super().__init__()
  586. attn_unit = LinearSelfAttention(
  587. embed_dim=embed_dim, attn_dropout=attn_dropout, bias=True
  588. )
  589. self.pre_norm_attn = nn.Sequential(
  590. nn.GroupNorm(num_channels=embed_dim, num_groups=1),
  591. attn_unit,
  592. nn.Dropout(p=dropout)
  593. )
  594. self.pre_norm_ffn = nn.Sequential(
  595. nn.GroupNorm(num_channels=embed_dim, num_groups=1),
  596. ConvLayer(
  597. in_channels=embed_dim,
  598. out_channels=ffn_latent_dim,
  599. kernel_size=1,
  600. stride=1,
  601. bias=True,
  602. use_norm=False,
  603. use_act=True,
  604. ),
  605. nn.Dropout(p=ffn_dropout),
  606. ConvLayer(
  607. in_channels=ffn_latent_dim,
  608. out_channels=embed_dim,
  609. kernel_size=1,
  610. stride=1,
  611. bias=True,
  612. use_norm=False,
  613. use_act=False,
  614. ),
  615. nn.Dropout(p=dropout)
  616. )
  617. self.embed_dim = embed_dim
  618. self.ffn_dim = ffn_latent_dim
  619. self.ffn_dropout = ffn_dropout
  620. self.std_dropout = dropout
  621. def forward(self,
  622. x: Tensor, x_prev: Optional[Tensor] = None, *args, **kwargs
  623. ) -> Tensor:
  624. if x_prev is None:
  625. # self-attention
  626. x = x + self.pre_norm_attn(x)
  627. else:
  628. # cross-attention
  629. res = x
  630. x = self.pre_norm_attn[0](x) # norm
  631. x = self.pre_norm_attn[1](x, x_prev) # attn
  632. x = self.pre_norm_attn[2](x) # drop
  633. x = x + res # residual
  634. x = x + self.pre_norm_ffn(x)
  635. return x
  636. def make_divisible(
  637. v: Union[float, int],
  638. divisor: Optional[int] = 8,
  639. min_value: Optional[Union[float, int]] = None,
  640. ) -> Union[float, int]:
  641. """
  642. This function is taken from the original tf repo.
  643. It ensures that all layers have a channel number that is divisible by 8
  644. It can be seen here:
  645. https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
  646. :param v:
  647. :param divisor:
  648. :param min_value:
  649. :return:
  650. """
  651. if min_value is None:
  652. min_value = divisor
  653. new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
  654. # Make sure that round down does not go down by more than 10%.
  655. if new_v < 0.9 * v:
  656. new_v += divisor
  657. return new_v
  658. class Identity(nn.Module):
  659. """
  660. This is a place-holder and returns the same tensor.
  661. """
  662. def __init__(self):
  663. super(Identity, self).__init__()
  664. def forward(self, x: Tensor) -> Tensor:
  665. return x
  666. def profile_module(self, x: Tensor) -> Tuple[Tensor, float, float]:
  667. return x, 0.0, 0.0
  668. class InvertedResidual(nn.Module):
  669. """
  670. This class implements the inverted residual block, as described in `MobileNetv2 <https://arxiv.org/abs/1801.04381>`_ paper
  671. Args:
  672. in_channels (int): :math:`C_{in}` from an expected input of size :math:`(N, C_{in}, H_{in}, W_{in})`
  673. out_channels (int): :math:`C_{out}` from an expected output of size :math:`(N, C_{out}, H_{out}, W_{out)`
  674. stride (int): Use convolutions with a stride. Default: 1
  675. expand_ratio (Union[int, float]): Expand the input channels by this factor in depth-wise conv
  676. skip_connection (Optional[bool]): Use skip-connection. Default: True
  677. Shape:
  678. - Input: :math:`(N, C_{in}, H_{in}, W_{in})`
  679. - Output: :math:`(N, C_{out}, H_{out}, W_{out})`
  680. .. note::
  681. If `in_channels =! out_channels` and `stride > 1`, we set `skip_connection=False`
  682. """
  683. def __init__(
  684. self,
  685. in_channels: int,
  686. out_channels: int,
  687. stride: int,
  688. expand_ratio: Union[int, float], # 扩张因子,到底要在隐层将通道数扩张多少倍
  689. skip_connection: Optional[bool] = True, # 是否使用跳跃连接
  690. ) -> None:
  691. assert stride in [1, 2]
  692. hidden_dim = make_divisible(int(round(in_channels * expand_ratio)), 8)
  693. super().__init__()
  694. block = nn.Sequential()
  695. if expand_ratio != 1:
  696. block.add_module(
  697. name="exp_1x1",
  698. module=ConvLayer(
  699. in_channels=in_channels,
  700. out_channels=hidden_dim,
  701. kernel_size=1
  702. ),
  703. )
  704. block.add_module(
  705. name="conv_3x3",
  706. module=ConvLayer(
  707. in_channels=hidden_dim,
  708. out_channels=hidden_dim,
  709. stride=stride,
  710. kernel_size=3,
  711. groups=hidden_dim # depth-wise convolution
  712. ),
  713. )
  714. block.add_module(
  715. name="red_1x1",
  716. module=ConvLayer(
  717. in_channels=hidden_dim,
  718. out_channels=out_channels,
  719. kernel_size=1,
  720. use_act=False, # 最后一层不使用激活函数
  721. use_norm=True,
  722. ),
  723. )
  724. self.block = block
  725. self.in_channels = in_channels
  726. self.out_channels = out_channels
  727. self.exp = expand_ratio
  728. self.stride = stride
  729. self.use_res_connect = (
  730. self.stride == 1 and in_channels == out_channels and skip_connection
  731. )
  732. def forward(self, x: Tensor, *args, **kwargs) -> Tensor:
  733. if self.use_res_connect: # 如果需要使用残差连接
  734. return x + self.block(x)
  735. else:
  736. return self.block(x)
  737. class MobileViTBlock(nn.Module):
  738. """
  739. This class defines the `MobileViT block <https://arxiv.org/abs/2110.02178?context=cs.LG>`_
  740. Args:
  741. opts: command line arguments
  742. in_channels (int): :math:`C_{in}` from an expected input of size :math:`(N, C_{in}, H, W)`
  743. transformer_dim (int): Input dimension to the transformer unit
  744. ffn_dim (int): Dimension of the FFN block
  745. n_transformer_blocks (int): Number of transformer blocks. Default: 2
  746. head_dim (int): Head dimension in the multi-head attention. Default: 32
  747. attn_dropout (float): Dropout in multi-head attention. Default: 0.0
  748. dropout (float): Dropout rate. Default: 0.0
  749. ffn_dropout (float): Dropout between FFN layers in transformer. Default: 0.0
  750. patch_h (int): Patch height for unfolding operation. Default: 8
  751. patch_w (int): Patch width for unfolding operation. Default: 8
  752. transformer_norm_layer (Optional[str]): Normalization layer in the transformer block. Default: layer_norm
  753. conv_ksize (int): Kernel size to learn local representations in MobileViT block. Default: 3
  754. no_fusion (Optional[bool]): Do not combine the input and output feature maps. Default: False
  755. """
  756. def __init__(
  757. self,
  758. in_channels: int, # 输入通道数
  759. transformer_dim: int, # 输入到transformer的每个token序列长度
  760. ffn_dim: int, # feed forward network的维度
  761. n_transformer_blocks: int = 2, # transformer block的个数
  762. head_dim: int = 32,
  763. attn_dropout: float = 0.0,
  764. dropout: float = 0.0,
  765. ffn_dropout: float = 0.0,
  766. patch_h: int = 8,
  767. patch_w: int = 8,
  768. conv_ksize: Optional[int] = 3, # 卷积核大小
  769. *args,
  770. **kwargs
  771. ) -> None:
  772. super().__init__()
  773. conv_3x3_in = ConvLayer(
  774. in_channels=in_channels,
  775. out_channels=in_channels,
  776. kernel_size=conv_ksize,
  777. stride=1
  778. )
  779. conv_1x1_in = ConvLayer(
  780. in_channels=in_channels,
  781. out_channels=transformer_dim,
  782. kernel_size=1,
  783. stride=1,
  784. use_norm=False,
  785. use_act=False
  786. )
  787. conv_1x1_out = ConvLayer(
  788. in_channels=transformer_dim,
  789. out_channels=in_channels,
  790. kernel_size=1,
  791. stride=1
  792. )
  793. conv_3x3_out = ConvLayer(
  794. in_channels=2 * in_channels,
  795. out_channels=in_channels,
  796. kernel_size=conv_ksize,
  797. stride=1
  798. )
  799. self.local_rep = nn.Sequential()
  800. self.local_rep.add_module(name="conv_3x3", module=conv_3x3_in)
  801. self.local_rep.add_module(name="conv_1x1", module=conv_1x1_in)
  802. assert transformer_dim % head_dim == 0 # 验证transformer_dim是否可以被head_dim整除
  803. num_heads = transformer_dim // head_dim
  804. global_rep = [
  805. TransformerEncoder(
  806. embed_dim=transformer_dim,
  807. ffn_latent_dim=ffn_dim,
  808. num_heads=num_heads,
  809. attn_dropout=attn_dropout,
  810. dropout=dropout,
  811. ffn_dropout=ffn_dropout
  812. )
  813. for _ in range(n_transformer_blocks)
  814. ]
  815. global_rep.append(nn.LayerNorm(transformer_dim))
  816. self.global_rep = nn.Sequential(*global_rep)
  817. self.conv_proj = conv_1x1_out
  818. self.fusion = conv_3x3_out
  819. self.patch_h = patch_h
  820. self.patch_w = patch_w
  821. self.patch_area = self.patch_w * self.patch_h
  822. self.cnn_in_dim = in_channels
  823. self.cnn_out_dim = transformer_dim
  824. self.n_heads = num_heads
  825. self.ffn_dim = ffn_dim
  826. self.dropout = dropout
  827. self.attn_dropout = attn_dropout
  828. self.ffn_dropout = ffn_dropout
  829. self.n_blocks = n_transformer_blocks
  830. self.conv_ksize = conv_ksize
  831. def unfolding(self, x: Tensor) -> Tuple[Tensor, Dict]:
  832. patch_w, patch_h = self.patch_w, self.patch_h
  833. patch_area = patch_w * patch_h
  834. batch_size, in_channels, orig_h, orig_w = x.shape
  835. new_h = int(math.ceil(orig_h / self.patch_h) * self.patch_h) # 为后文判断是否需要插值做准备
  836. new_w = int(math.ceil(orig_w / self.patch_w) * self.patch_w) # 为后文判断是否需要插值做准备
  837. interpolate = False
  838. if new_w != orig_w or new_h != orig_h:
  839. # Note: Padding can be done, but then it needs to be handled in attention function.
  840. x = F.interpolate(x, size=(new_h, new_w), mode="bilinear", align_corners=False)
  841. interpolate = True
  842. # number of patches along width and height
  843. num_patch_w = new_w // patch_w # n_w
  844. num_patch_h = new_h // patch_h # n_h
  845. num_patches = num_patch_h * num_patch_w # N
  846. # [B, C, H, W] -> [B * C * n_h, p_h, n_w, p_w]
  847. x = x.reshape(batch_size * in_channels * num_patch_h, patch_h, num_patch_w, patch_w)
  848. # [B * C * n_h, p_h, n_w, p_w] -> [B * C * n_h, n_w, p_h, p_w]
  849. x = x.transpose(1, 2)
  850. # [B * C * n_h, n_w, p_h, p_w] -> [B, C, N, P] where P = p_h * p_w and N = n_h * n_w
  851. x = x.reshape(batch_size, in_channels, num_patches, patch_area)
  852. # [B, C, N, P] -> [B, P, N, C]
  853. x = x.transpose(1, 3)
  854. # [B, P, N, C] -> [BP, N, C]
  855. x = x.reshape(batch_size * patch_area, num_patches, -1)
  856. info_dict = {
  857. "orig_size": (orig_h, orig_w),
  858. "batch_size": batch_size,
  859. "interpolate": interpolate,
  860. "total_patches": num_patches,
  861. "num_patches_w": num_patch_w,
  862. "num_patches_h": num_patch_h,
  863. }
  864. return x, info_dict
  865. def folding(self, x: Tensor, info_dict: Dict) -> Tensor:
  866. n_dim = x.dim()
  867. assert n_dim == 3, "Tensor should be of shape BPxNxC. Got: {}".format(
  868. x.shape
  869. )
  870. # [BP, N, C] --> [B, P, N, C]
  871. # 将x变成连续的张量,以便进行重塑操作
  872. x = x.contiguous().view(
  873. # 重塑x的第一个维度为批量大小
  874. info_dict["batch_size"],
  875. # 重塑x的第二个维度为每个图像块的像素数
  876. self.patch_area,
  877. # 重塑x的第三个维度为每个批次中的图像块总数
  878. info_dict["total_patches"],
  879. # 保持x的最后一个维度不变
  880. -1
  881. )
  882. batch_size, pixels, num_patches, channels = x.size()
  883. num_patch_h = info_dict["num_patches_h"]
  884. num_patch_w = info_dict["num_patches_w"]
  885. # [B, P, N, C] -> [B, C, N, P]
  886. x = x.transpose(1, 3)
  887. # [B, C, N, P] -> [B*C*n_h, n_w, p_h, p_w]
  888. x = x.reshape(batch_size * channels * num_patch_h, num_patch_w, self.patch_h, self.patch_w)
  889. # [B*C*n_h, n_w, p_h, p_w] -> [B*C*n_h, p_h, n_w, p_w]
  890. x = x.transpose(1, 2)
  891. # [B*C*n_h, p_h, n_w, p_w] -> [B, C, H, W]
  892. x = x.reshape(batch_size, channels, num_patch_h * self.patch_h, num_patch_w * self.patch_w)
  893. if info_dict["interpolate"]:
  894. x = F.interpolate(
  895. x,
  896. size=info_dict["orig_size"],
  897. mode="bilinear",
  898. align_corners=False,
  899. )
  900. return x
  901. def forward(self, x: Tensor) -> Tensor:
  902. res = x
  903. fm = self.local_rep(x) # [4, 64, 28, 28]
  904. # convert feature map to patches
  905. patches, info_dict = self.unfolding(fm) # [16, 196, 64]
  906. # print(patches.shape)
  907. # learn global representations
  908. for transformer_layer in self.global_rep:
  909. patches = transformer_layer(patches)
  910. # [B x Patch x Patches x C] -> [B x C x Patches x Patch]
  911. # Patch 所有的条状Patch的数量
  912. # Patches 每个条状Patch的长度
  913. fm = self.folding(x=patches, info_dict=info_dict)
  914. fm = self.conv_proj(fm)
  915. fm = self.fusion(torch.cat((res, fm), dim=1))
  916. return fm
  917. class MobileViTBlockV2(nn.Module):
  918. """
  919. This class defines the `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ block
  920. Args:
  921. opts: command line arguments
  922. in_channels (int): :math:`C_{in}` from an expected input of size :math:`(N, C_{in}, H, W)`
  923. attn_unit_dim (int): Input dimension to the attention unit
  924. ffn_multiplier (int): Expand the input dimensions by this factor in FFN. Default is 2.
  925. n_attn_blocks (Optional[int]): Number of attention units. Default: 2
  926. attn_dropout (Optional[float]): Dropout in multi-head attention. Default: 0.0
  927. dropout (Optional[float]): Dropout rate. Default: 0.0
  928. ffn_dropout (Optional[float]): Dropout between FFN layers in transformer. Default: 0.0
  929. patch_h (Optional[int]): Patch height for unfolding operation. Default: 8
  930. patch_w (Optional[int]): Patch width for unfolding operation. Default: 8
  931. conv_ksize (Optional[int]): Kernel size to learn local representations in MobileViT block. Default: 3
  932. dilation (Optional[int]): Dilation rate in convolutions. Default: 1
  933. attn_norm_layer (Optional[str]): Normalization layer in the attention block. Default: layer_norm_2d
  934. """
  935. def __init__(self,
  936. in_channels: int,
  937. attn_unit_dim: int,
  938. ffn_multiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
  939. n_transformer_blocks: Optional[int] = 2,
  940. attn_dropout: Optional[float] = 0.0,
  941. dropout: Optional[float] = 0.0,
  942. ffn_dropout: Optional[float] = 0.0,
  943. patch_h: Optional[int] = 8,
  944. patch_w: Optional[int] = 8,
  945. conv_ksize: Optional[int] = 3,
  946. *args,
  947. **kwargs) -> None:
  948. super(MobileViTBlockV2, self).__init__()
  949. cnn_out_dim = attn_unit_dim
  950. conv_3x3_in = ConvLayer(
  951. in_channels=in_channels,
  952. out_channels=in_channels,
  953. kernel_size=conv_ksize,
  954. stride=1,
  955. use_norm=True,
  956. use_act=True,
  957. groups=in_channels,
  958. )
  959. conv_1x1_in = ConvLayer(
  960. in_channels=in_channels,
  961. out_channels=cnn_out_dim,
  962. kernel_size=1,
  963. stride=1,
  964. use_norm=False,
  965. use_act=False,
  966. )
  967. self.local_rep = nn.Sequential(conv_3x3_in, conv_1x1_in)
  968. self.global_rep, attn_unit_dim = self._build_attn_layer(
  969. d_model=attn_unit_dim,
  970. ffn_mult=ffn_multiplier,
  971. n_layers=n_transformer_blocks,
  972. attn_dropout=attn_dropout,
  973. dropout=dropout,
  974. ffn_dropout=ffn_dropout,
  975. )
  976. self.conv_proj = ConvLayer(
  977. in_channels=cnn_out_dim,
  978. out_channels=in_channels,
  979. kernel_size=1,
  980. stride=1,
  981. use_norm=True,
  982. use_act=False,
  983. )
  984. self.patch_h = patch_h
  985. self.patch_w = patch_w
  986. self.patch_area = self.patch_w * self.patch_h
  987. self.cnn_in_dim = in_channels
  988. self.cnn_out_dim = cnn_out_dim
  989. self.transformer_in_dim = attn_unit_dim
  990. self.dropout = dropout
  991. self.attn_dropout = attn_dropout
  992. self.ffn_dropout = ffn_dropout
  993. self.n_blocks = n_transformer_blocks
  994. self.conv_ksize = conv_ksize
  995. def _build_attn_layer(self,
  996. d_model: int,
  997. ffn_mult: Union[Sequence, int, float],
  998. n_layers: int,
  999. attn_dropout: float,
  1000. dropout: float,
  1001. ffn_dropout: float,
  1002. attn_norm_layer: str = "layer_norm_2d",
  1003. *args,
  1004. **kwargs) -> Tuple[nn.Module, int]:
  1005. if isinstance(ffn_mult, Sequence) and len(ffn_mult) == 2:
  1006. ffn_dims = (
  1007. np.linspace(ffn_mult[0], ffn_mult[1], n_layers, dtype=float) * d_model
  1008. )
  1009. elif isinstance(ffn_mult, Sequence) and len(ffn_mult) == 1:
  1010. ffn_dims = [ffn_mult[0] * d_model] * n_layers
  1011. elif isinstance(ffn_mult, (int, float)):
  1012. ffn_dims = [ffn_mult * d_model] * n_layers
  1013. else:
  1014. raise NotImplementedError
  1015. ffn_dims = [int((d // 16) * 16) for d in ffn_dims]
  1016. global_rep = [
  1017. LinearAttnFFN(
  1018. embed_dim=d_model,
  1019. ffn_latent_dim=ffn_dims[block_idx],
  1020. attn_dropout=attn_dropout,
  1021. dropout=dropout,
  1022. ffn_dropout=ffn_dropout,
  1023. )
  1024. for block_idx in range(n_layers)
  1025. ]
  1026. global_rep.append(nn.GroupNorm(1, d_model))
  1027. return nn.Sequential(*global_rep), d_model
  1028. def forward(
  1029. self, x: Union[Tensor, Tuple[Tensor]], *args, **kwargs
  1030. ) -> Union[Tensor, Tuple[Tensor, Tensor]]:
  1031. if isinstance(x, Tuple) and len(x) == 2:
  1032. # for spatio-temporal data (e.g., videos)
  1033. return self.forward_temporal(x=x[0], x_prev=x[1])
  1034. elif isinstance(x, Tensor):
  1035. # for image data
  1036. return self.forward_spatial(x)
  1037. else:
  1038. raise NotImplementedError
  1039. def forward_spatial(self, x: Tensor, *args, **kwargs) -> Tensor:
  1040. x = self.resize_input_if_needed(x)
  1041. # learn global representations on all patches
  1042. fm = self.local_rep(x)
  1043. patches, output_size = self.unfolding_pytorch(fm)
  1044. # print(f"original x.shape = {patches.shape}")
  1045. patches = self.global_rep(patches)
  1046. # [B x Patch x Patches x C] --> [B x C x Patches x Patch]
  1047. fm = self.folding_pytorch(patches=patches, output_size=output_size)
  1048. fm = self.conv_proj(fm)
  1049. return fm
  1050. def forward_temporal(
  1051. self, x: Tensor, x_prev: Optional[Tensor] = None
  1052. ) -> Union[Tensor, Tuple[Tensor, Tensor]]:
  1053. x = self.resize_input_if_needed(x)
  1054. fm = self.local_rep(x)
  1055. patches, output_size = self.unfolding_pytorch(fm)
  1056. for global_layer in self.global_rep:
  1057. if isinstance(global_layer, LinearAttnFFN):
  1058. patches = global_layer(x=patches, x_prev=x_prev)
  1059. else:
  1060. patches = global_layer(patches)
  1061. fm = self.folding_pytorch(patches=patches, output_size=output_size)
  1062. fm = self.conv_proj(fm)
  1063. return fm, patches
  1064. def resize_input_if_needed(self, x: Tensor) -> Tensor:
  1065. # print(f"original x.shape = {x.shape}")
  1066. batch_size, in_channels, orig_h, orig_w = x.shape
  1067. if orig_h % self.patch_h != 0 or orig_w % self.patch_w != 0:
  1068. new_h = int(math.ceil(orig_h / self.patch_h) * self.patch_h)
  1069. new_w = int(math.ceil(orig_w / self.patch_w) * self.patch_w)
  1070. x = F.interpolate(
  1071. x, size=(new_h, new_w), mode="bilinear", align_corners=True
  1072. )
  1073. # print(f"changed x.shape = {x.shape}")
  1074. return x
  1075. def unfolding_pytorch(self, feature_map: Tensor) -> Tuple[Tensor, Tuple[int, int]]:
  1076. batch_size, in_channels, img_h, img_w = feature_map.shape
  1077. # [B, C, H, W] --> [B, C, P, N]
  1078. patches = F.unfold(
  1079. feature_map,
  1080. kernel_size=(self.patch_h, self.patch_w),
  1081. stride=(self.patch_h, self.patch_w),
  1082. )
  1083. patches = patches.reshape(
  1084. batch_size, in_channels, self.patch_h * self.patch_w, -1
  1085. )
  1086. return patches, (img_h, img_w)
  1087. def folding_pytorch(self, patches: Tensor, output_size: Tuple[int, int]) -> Tensor:
  1088. batch_size, in_dim, patch_size, n_patches = patches.shape
  1089. # [B, C, P, N]
  1090. patches = patches.reshape(batch_size, in_dim * patch_size, n_patches)
  1091. feature_map = F.fold(
  1092. patches,
  1093. output_size=output_size,
  1094. kernel_size=(self.patch_h, self.patch_w),
  1095. stride=(self.patch_h, self.patch_w),
  1096. )
  1097. return feature_map
  1098. class MobileViTV2(nn.Module):
  1099. """
  1100. This class defines the `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ architecture
  1101. """
  1102. def __init__(self, model_cfg: Dict, num_classes: int = 1000):
  1103. super().__init__()
  1104. image_channels = model_cfg["layer0"]["img_channels"]
  1105. out_channels = model_cfg["layer0"]["out_channels"]
  1106. self.model_conf_dict = dict()
  1107. self.conv_1 = ConvLayer(
  1108. in_channels=image_channels,
  1109. out_channels=out_channels,
  1110. kernel_size=3,
  1111. stride=2,
  1112. use_norm=True,
  1113. use_act=True,
  1114. )
  1115. self.model_conf_dict["conv1"] = {"in": image_channels, "out": out_channels}
  1116. in_channels = out_channels
  1117. self.layer_1, out_channels = self._make_layer(
  1118. input_channel=in_channels, cfg=model_cfg["layer1"]
  1119. )
  1120. self.model_conf_dict["layer1"] = {"in": in_channels, "out": out_channels}
  1121. in_channels = out_channels
  1122. self.layer_2, out_channels = self._make_layer(
  1123. input_channel=in_channels, cfg=model_cfg["layer2"]
  1124. )
  1125. self.model_conf_dict["layer2"] = {"in": in_channels, "out": out_channels}
  1126. in_channels = out_channels
  1127. self.layer_3, out_channels = self._make_layer(
  1128. input_channel=in_channels, cfg=model_cfg["layer3"]
  1129. )
  1130. self.model_conf_dict["layer3"] = {"in": in_channels, "out": out_channels}
  1131. in_channels = out_channels
  1132. self.layer_4, out_channels = self._make_layer(
  1133. input_channel=in_channels, cfg=model_cfg["layer4"]
  1134. )
  1135. in_channels = out_channels
  1136. self.layer_5, out_channels = self._make_layer(
  1137. input_channel=in_channels, cfg=model_cfg["layer5"] # 有可能会被冻结,来进行网络微调
  1138. )
  1139. self.model_conf_dict["layer5"] = {"in": in_channels, "out": out_channels}
  1140. self.conv_1x1_exp = Identity()
  1141. self.model_conf_dict["exp_before_cls"] = {
  1142. "in": out_channels,
  1143. "out": out_channels,
  1144. }
  1145. self.classifier = nn.Sequential() # 有可能会被冻结,来进行网络微调
  1146. self.classifier.add_module(name="global_pool", module=nn.AdaptiveAvgPool2d(1))
  1147. self.classifier.add_module(name="flatten", module=nn.Flatten())
  1148. self.classifier.add_module(name="fc", module=nn.Linear(in_features=out_channels, out_features=num_classes))
  1149. self.apply(self.init_parameters)
  1150. self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
  1151. def _make_layer(self, input_channel, cfg: Dict, dilate: Optional[bool] = False) -> Optional[
  1152. tuple[nn.Sequential, int]]:
  1153. block_type = cfg.get("block_type", "mobilevit")
  1154. if block_type.lower() == "mobilevit":
  1155. return self._make_mit_layer(
  1156. input_channel=input_channel, cfg=cfg
  1157. )
  1158. else:
  1159. return self._make_mobilenet_layer(
  1160. input_channel=input_channel, cfg=cfg
  1161. )
  1162. @staticmethod
  1163. def _make_mobilenet_layer(
  1164. input_channel: int, cfg: Dict) -> Tuple[nn.Sequential, int]:
  1165. output_channels = cfg.get("out_channels")
  1166. num_blocks = cfg.get("num_blocks", 2)
  1167. expand_ratio = cfg.get("expand_ratio", 4)
  1168. block = []
  1169. for i in range(num_blocks):
  1170. stride = cfg.get("stride", 1) if i == 0 else 1
  1171. layer = InvertedResidual(
  1172. in_channels=input_channel,
  1173. out_channels=output_channels,
  1174. stride=stride,
  1175. expand_ratio=expand_ratio
  1176. )
  1177. block.append(layer)
  1178. input_channel = output_channels
  1179. return nn.Sequential(*block), input_channel
  1180. @staticmethod
  1181. def _make_mit_layer(input_channel: int, cfg: Dict) -> [nn.Sequential, int]:
  1182. block = []
  1183. stride = cfg.get("stride", 1)
  1184. if stride == 2:
  1185. layer = InvertedResidual(
  1186. in_channels=input_channel,
  1187. out_channels=cfg.get("out_channels"),
  1188. stride=stride,
  1189. expand_ratio=cfg.get("mv_expand_ratio", 4),
  1190. )
  1191. block.append(layer)
  1192. input_channel = cfg.get("out_channels")
  1193. attn_unit_dim = cfg["attn_unit_dim"]
  1194. ffn_multiplier = cfg.get("ffn_multiplier")
  1195. block.append(
  1196. MobileViTBlockV2(
  1197. in_channels=input_channel,
  1198. out_channels=cfg.get("out_channels"),
  1199. attn_unit_dim=attn_unit_dim,
  1200. ffn_multiplier=ffn_multiplier,
  1201. n_transformer_blocks=cfg.get("attn_blocks", 1),
  1202. patch_h=cfg.get("patch_h", 2),
  1203. patch_w=cfg.get("patch_w", 2),
  1204. dropout=cfg.get("dropout", 0.1),
  1205. attn_dropout=cfg.get("attn_dropout", 0.1),
  1206. ff_dropout=cfg.get("ff_dropout", 0.1),
  1207. conv_ksize=3,
  1208. )
  1209. )
  1210. return nn.Sequential(*block), input_channel
  1211. pass
  1212. @staticmethod
  1213. def init_parameters(m):
  1214. if isinstance(m, nn.Conv2d):
  1215. if m.weight is not None:
  1216. nn.init.kaiming_normal_(m.weight, mode="fan_out")
  1217. if m.bias is not None:
  1218. nn.init.zeros_(m.bias)
  1219. elif isinstance(m, (nn.GroupNorm, nn.BatchNorm2d, nn.LayerNorm)):
  1220. if m.weight is not None:
  1221. nn.init.ones_(m.weight)
  1222. if m.bias is not None:
  1223. nn.init.zeros_(m.bias)
  1224. elif isinstance(m, (nn.Linear,)):
  1225. if m.weight is not None:
  1226. nn.init.trunc_normal_(m.weight, mean=0.0, std=0.02)
  1227. if m.bias is not None:
  1228. nn.init.zeros_(m.bias)
  1229. else:
  1230. pass
  1231. def forward(self, x: torch.Tensor) -> torch.Tensor:
  1232. unique_tensors = {}
  1233. x = self.conv_1(x)
  1234. width, height = x.shape[2], x.shape[3]
  1235. unique_tensors[(width, height)] = x
  1236. x = self.layer_1(x)
  1237. width, height = x.shape[2], x.shape[3]
  1238. unique_tensors[(width, height)] = x
  1239. x = self.layer_2(x)
  1240. width, height = x.shape[2], x.shape[3]
  1241. unique_tensors[(width, height)] = x
  1242. x = self.layer_3(x)
  1243. width, height = x.shape[2], x.shape[3]
  1244. unique_tensors[(width, height)] = x
  1245. x = self.layer_4(x)
  1246. width, height = x.shape[2], x.shape[3]
  1247. unique_tensors[(width, height)] = x
  1248. x = self.layer_5(x)
  1249. width, height = x.shape[2], x.shape[3]
  1250. unique_tensors[(width, height)] = x
  1251. result_list = list(unique_tensors.values())[-4:]
  1252. return result_list
  1253. def mobile_vit2_xx_small(num_classes: int = 1000):
  1254. # pretrain weight link
  1255. # https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit2_xxs.pt
  1256. config = get_config("2xx_small")
  1257. m = MobileViTV2(config, num_classes=num_classes)
  1258. return m
  1259. if __name__ == "__main__":
  1260. # Generating Sample image
  1261. image_size = (1, 3, 640, 640)
  1262. image = torch.rand(*image_size)
  1263. # Model
  1264. model = mobile_vit2_xx_small()
  1265. out = model(image)
  1266. print(out)

四、手把手教你添加MobileViTv2

4.1 修改一

第一步还是建立文件,我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹 然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可

4bc3564d8ca046abbba7a8ce935b31fa.png


4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( ,然后在其内部导入我们的检测头如下图所示。

a06c3446106a45a597556e9500007a2a.png


4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( !

67b28bda87e44d3285f0241acd165256.png


4.4 修改四

添加如下两行代码!!!

1655de23b1834dfca4f304336f0f2c19.png ​​


4.5 修改五

找到七百多行大概把具体看图片,按照图片来修改就行,添加红框内的部分,注意没有()只是 函数 名。

e4599befab4e446ba1468f45ac258505.png

  1. elif m in {自行添加对应的模型即可,下面都是一样的}:
  2. m = m(*args)
  3. c2 = m.width_list # 返回通道列表
  4. backbone = True


4.6 修改六

下面的两个红框内都是需要改动的。

dbfbc13f92c647ef976876222cb2280e.png ​​

  1. if isinstance(c2, list):
  2. m_ = m
  3. m_.backbone = True
  4. else:
  5. m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
  6. t = str(m)[8:-2].replace('__main__.', '') # module type
  7. m.np = sum(x.numel() for x in m_.parameters()) # number params
  8. m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type


4.7 修改七

如下的也需要修改,全部按照我的来。

e715dd894b0b4720ba9fdf31c86f6bff.png ​​

代码如下把原先的代码替换了即可。

  1. if verbose:
  2. LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
  3. save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
  4. layers.append(m_)
  5. if i == 0:
  6. ch = []
  7. if isinstance(c2, list):
  8. ch.extend(c2)
  9. if len(c2) != 5:
  10. ch.insert(0, 0)
  11. else:
  12. ch.append(c2)


4.8 修改八

修改八和前面的都不太一样,需要修改前向传播中的一个部分, 已经离开了parse_model方法了。

可以在图片中开代码行数,没有离开task.py文件都是同一个文件。 同时这个部分有好几个前向传播都很相似,大家不要看错了, 是70多行左右的!!!,同时我后面提供了代码,大家直接复制粘贴即可,有时间我针对这里会出一个视频。

​​​

代码如下->

  1. def _predict_once(self, x, profile=False, visualize=False, embed=None):
  2. """
  3. Perform a forward pass through the network.
  4. Args:
  5. x (torch.Tensor): The input tensor to the model.
  6. profile (bool): Print the computation time of each layer if True, defaults to False.
  7. visualize (bool): Save the feature maps of the model if True, defaults to False.
  8. embed (list, optional): A list of feature vectors/embeddings to return.
  9. Returns:
  10. (torch.Tensor): The last output of the model.
  11. """
  12. y, dt, embeddings = [], [], [] # outputs
  13. for m in self.model:
  14. if m.f != -1: # if not from previous layer
  15. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  16. if profile:
  17. self._profile_one_layer(m, x, dt)
  18. if hasattr(m, 'backbone'):
  19. x = m(x)
  20. if len(x) != 5: # 0 - 5
  21. x.insert(0, None)
  22. for index, i in enumerate(x):
  23. if index in self.save:
  24. y.append(i)
  25. else:
  26. y.append(None)
  27. x = x[-1] # 最后一个输出传给下一层
  28. else:
  29. x = m(x) # run
  30. y.append(x if m.i in self.save else None) # save output
  31. if visualize:
  32. feature_visualization(x, m.type, m.i, save_dir=visualize)
  33. if embed and m.i in embed:
  34. embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
  35. if m.i == max(embed):
  36. return torch.unbind(torch.cat(embeddings, 1), dim=0)
  37. return x

到这里就完成了修改部分,但是这里面细节很多,大家千万要注意不要替换多余的代码,导致报错,也不要拉下任何一部,都会导致运行失败,而且报错很难排查!!!很难排查!!!


注意!!! 额外的修改!

关注我的其实都知道,我大部分的修改都是一样的,这个网络需要额外的修改一步,就是s一个参数,将下面的s改为640!!!即可完美运行!!

0574667f1dec40dbb0429703364c3b22.png


打印计算量问题解决方案

我们找到如下文件'ultralytics/utils/torch_utils.py'按照如下的图片进行修改,否则容易打印不出来计算量。

​​


注意事项!!!

如果大家在验证的时候报错形状不匹配的错误可以固定验证集的图片尺寸,方法如下 ->

找到下面这个文件ultralytics/ models /yolo/detect/train.py然后其中有一个类是DetectionTrainer class中的build_dataset函数中的一个参数rect=mode == 'val'改为rect=False

51b306b8f9304447ad81470a679377f8.png


五、MobileViTv2的yaml文件

5.1 MobileViTv2的yaml文件

此版本的训练信息:YOLO11-mobileNetV2 summary: 575 layers, 3,022,364 parameters, 3,022,348 gradients, 9.8 GFLOPs

  1. # Ultralytics YOLO 🚀, AGPL-3.0 license
  2. # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
  3. # Parameters
  4. nc: 80 # number of classes
  5. scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  6. # [depth, width, max_channels]
  7. n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  8. s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  9. m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  10. l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  11. x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
  12. # 共四个版本 "mobile_vit_small, mobile_vit_x_small, mobile_vit_xx_small"
  13. # YOLO11n backbone
  14. backbone:
  15. # [from, repeats, module, args]
  16. - [-1, 1, mobile_vit2_xx_small, []] # 0-4 P1/2
  17. - [-1, 1, SPPF, [1024, 5]] # 5
  18. - [-1, 2, C2PSA, [1024]] # 6
  19. # YOLO11n head
  20. head:
  21. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  22. - [[-1, 3], 1, Concat, [1]] # cat backbone P4
  23. - [-1, 2, C3k2, [512, False]] # 9
  24. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  25. - [[-1, 2], 1, Concat, [1]] # cat backbone P3
  26. - [-1, 2, C3k2, [256, False]] # 12 (P3/8-small)
  27. - [-1, 1, Conv, [256, 3, 2]]
  28. - [[-1, 9], 1, Concat, [1]] # cat head P4
  29. - [-1, 2, C3k2, [512, False]] # 15 (P4/16-medium)
  30. - [-1, 1, Conv, [512, 3, 2]]
  31. - [[-1, 6], 1, Concat, [1]] # cat head P5
  32. - [-1, 2, C3k2, [1024, True]] # 18 (P5/32-large)
  33. - [[12, 15, 18], 1, Detect, [nc]] # Detect(P3, P4, P5)

5.2 训练文件的代码

可以复制我的运行文件进行运行。

  1. import warnings
  2. warnings.filterwarnings('ignore')
  3. from ultralytics import YOLO
  4. if __name__ == '__main__':
  5. model = YOLO('yolov8-MLLA.yaml')
  6. # 如何切换模型版本, 上面的ymal文件可以改为 yolov8s.yaml就是使用的v8s,
  7. # 类似某个改进的yaml文件名称为yolov8-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolov8l-XXX.yaml即可(改的是上面YOLO中间的名字不是配置文件的)!
  8. # model.load('yolov8n.pt') # 是否加载预训练权重,科研不建议大家加载否则很难提升精度
  9. model.train(data=r"C:\Users\Administrator\PycharmProjects\yolov5-master\yolov5-master\Construction Site Safety.v30-raw-images_latestversion.yolov8\data.yaml",
  10. # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
  11. cache=False,
  12. imgsz=640,
  13. epochs=150,
  14. single_cls=False, # 是否是单类别检测
  15. batch=16,
  16. close_mosaic=0,
  17. workers=0,
  18. device='0',
  19. optimizer='SGD', # using SGD
  20. # resume='runs/train/exp21/weights/last.pt', # 如过想续训就设置last.pt的地址
  21. amp=True, # 如果出现训练损失为Nan可以关闭amp
  22. project='runs/train',
  23. name='exp',
  24. )


六、成功运行记录

下面是成功运行的截图,已经完成了有1个epochs的训练,图片太大截不全第2个epochs了。


七、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充 如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~

bd80c2385d0548e9a87edc73f9261794.gif ​​​