YOLOv11改进 | 融合改进篇 | Damo-YOLO融合SwinTransformer轻量化创新yolov11(可替换专栏20+主干)

一、本文介绍

本文给大家带来的最新改进机制是融合改进,利用 Damo-YOLO 配合 SwinTransformer ,其中Damo- YOLO SwinTransformer 本文的SwinTransformer可以替换为本专栏的20余种主干。

欢迎大家订阅我的专栏一起学习YOLO!

训练信息:YOLO11-SwinTransformer-RepGFPN summary: 353 layers, 2,736,687 parameters, 2,736,671 gradients, 6.3 GFLOPs
基础版本:YOLO11 summary: 319 layers, 2,591,010 parameters, 2,590,994 gradients, 6.4 GFLOPs

目录

一、本文介绍

二、原理介绍

三、Damo-YOLO核心代码使用方式

3.1 修改一

3.2 修改二

3.3 修改三

3.4 修改四

四、 Swintransformer的核心代码和使用方式

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

4.9 修改九

五、融合使用教程

六、成功运行截图

七、本文总结


二、原理介绍

这部分就不复制过来了,大家想看的可以去对应文章找。

Damo-YOLO: Damo-YOLO的地址

Swintransformer: YOLOv11改进 | 主干/Backbone篇 | 视觉变换器SwinTransformer目标检测网络( 适配yolov11全系列模型)_yolov11代码-CSDN博客


三、Damo-YOLO核心代码使用方式

下面的代码是GFPN的核心代码,我们将其复制导' ultralytics /nn'目录下,在其中创建一个文件,我这里起名为GFPN然后粘贴进去,其余使用方式看后面。

  1. import torch
  2. import torch.nn as nn
  3. import numpy as np
  4. class swish(nn.Module):
  5. def forward(self, x):
  6. return x * torch.sigmoid(x)
  7. def autopad(k, p=None, d=1): # kernel, padding, dilation
  8. """Pad to 'same' shape outputs."""
  9. if d > 1:
  10. k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
  11. if p is None:
  12. p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
  13. return p
  14. class Conv(nn.Module):
  15. """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
  16. default_act = swish() # default activation
  17. def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
  18. """Initialize Conv layer with given arguments including activation."""
  19. super().__init__()
  20. self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
  21. self.bn = nn.BatchNorm2d(c2)
  22. self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
  23. def forward(self, x):
  24. """Apply convolution, batch normalization and activation to input tensor."""
  25. return self.act(self.bn(self.conv(x)))
  26. def forward_fuse(self, x):
  27. """Perform transposed convolution of 2D data."""
  28. return self.act(self.conv(x))
  29. class RepConv(nn.Module):
  30. default_act = swish() # default activation
  31. def __init__(self, c1, c2, k=3, s=1, p=1, g=1, d=1, act=True, bn=False, deploy=False):
  32. """Initializes Light Convolution layer with inputs, outputs & optional activation function."""
  33. super().__init__()
  34. assert k == 3 and p == 1
  35. self.g = g
  36. self.c1 = c1
  37. self.c2 = c2
  38. self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
  39. self.bn = nn.BatchNorm2d(num_features=c1) if bn and c2 == c1 and s == 1 else None
  40. self.conv1 = Conv(c1, c2, k, s, p=p, g=g, act=False)
  41. self.conv2 = Conv(c1, c2, 1, s, p=(p - k // 2), g=g, act=False)
  42. def forward_fuse(self, x):
  43. """Forward process."""
  44. return self.act(self.conv(x))
  45. def forward(self, x):
  46. """Forward process."""
  47. id_out = 0 if self.bn is None else self.bn(x)
  48. return self.act(self.conv1(x) + self.conv2(x) + id_out)
  49. def get_equivalent_kernel_bias(self):
  50. """Returns equivalent kernel and bias by adding 3x3 kernel, 1x1 kernel and identity kernel with their biases."""
  51. kernel3x3, bias3x3 = self._fuse_bn_tensor(self.conv1)
  52. kernel1x1, bias1x1 = self._fuse_bn_tensor(self.conv2)
  53. kernelid, biasid = self._fuse_bn_tensor(self.bn)
  54. return kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid, bias3x3 + bias1x1 + biasid
  55. def _pad_1x1_to_3x3_tensor(self, kernel1x1):
  56. """Pads a 1x1 tensor to a 3x3 tensor."""
  57. if kernel1x1 is None:
  58. return 0
  59. else:
  60. return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])
  61. def _fuse_bn_tensor(self, branch):
  62. """Generates appropriate kernels and biases for convolution by fusing branches of the neural network."""
  63. if branch is None:
  64. return 0, 0
  65. if isinstance(branch, Conv):
  66. kernel = branch.conv.weight
  67. running_mean = branch.bn.running_mean
  68. running_var = branch.bn.running_var
  69. gamma = branch.bn.weight
  70. beta = branch.bn.bias
  71. eps = branch.bn.eps
  72. elif isinstance(branch, nn.BatchNorm2d):
  73. if not hasattr(self, 'id_tensor'):
  74. input_dim = self.c1 // self.g
  75. kernel_value = np.zeros((self.c1, input_dim, 3, 3), dtype=np.float32)
  76. for i in range(self.c1):
  77. kernel_value[i, i % input_dim, 1, 1] = 1
  78. self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
  79. kernel = self.id_tensor
  80. running_mean = branch.running_mean
  81. running_var = branch.running_var
  82. gamma = branch.weight
  83. beta = branch.bias
  84. eps = branch.eps
  85. std = (running_var + eps).sqrt()
  86. t = (gamma / std).reshape(-1, 1, 1, 1)
  87. return kernel * t, beta - running_mean * gamma / std
  88. def fuse_convs(self):
  89. """Combines two convolution layers into a single layer and removes unused attributes from the class."""
  90. if hasattr(self, 'conv'):
  91. return
  92. kernel, bias = self.get_equivalent_kernel_bias()
  93. self.conv = nn.Conv2d(in_channels=self.conv1.conv.in_channels,
  94. out_channels=self.conv1.conv.out_channels,
  95. kernel_size=self.conv1.conv.kernel_size,
  96. stride=self.conv1.conv.stride,
  97. padding=self.conv1.conv.padding,
  98. dilation=self.conv1.conv.dilation,
  99. groups=self.conv1.conv.groups,
  100. bias=True).requires_grad_(False)
  101. self.conv.weight.data = kernel
  102. self.conv.bias.data = bias
  103. for para in self.parameters():
  104. para.detach_()
  105. self.__delattr__('conv1')
  106. self.__delattr__('conv2')
  107. if hasattr(self, 'nm'):
  108. self.__delattr__('nm')
  109. if hasattr(self, 'bn'):
  110. self.__delattr__('bn')
  111. if hasattr(self, 'id_tensor'):
  112. self.__delattr__('id_tensor')
  113. class BasicBlock_3x3_Reverse(nn.Module):
  114. def __init__(self,
  115. ch_in,
  116. ch_hidden_ratio,
  117. ch_out,
  118. shortcut=True):
  119. super(BasicBlock_3x3_Reverse, self).__init__()
  120. assert ch_in == ch_out
  121. ch_hidden = int(ch_in * ch_hidden_ratio)
  122. self.conv1 = Conv(ch_hidden, ch_out, 3, s=1)
  123. self.conv2 = RepConv(ch_in, ch_hidden, 3, s=1)
  124. self.shortcut = shortcut
  125. def forward(self, x):
  126. y = self.conv2(x)
  127. y = self.conv1(y)
  128. if self.shortcut:
  129. return x + y
  130. else:
  131. return y
  132. class SPP(nn.Module):
  133. def __init__(
  134. self,
  135. ch_in,
  136. ch_out,
  137. k,
  138. pool_size
  139. ):
  140. super(SPP, self).__init__()
  141. self.pool = []
  142. for i, size in enumerate(pool_size):
  143. pool = nn.MaxPool2d(kernel_size=size,
  144. stride=1,
  145. padding=size // 2,
  146. ceil_mode=False)
  147. self.add_module('pool{}'.format(i), pool)
  148. self.pool.append(pool)
  149. self.conv = Conv(ch_in, ch_out, k)
  150. def forward(self, x):
  151. outs = [x]
  152. for pool in self.pool:
  153. outs.append(pool(x))
  154. y = torch.cat(outs, axis=1)
  155. y = self.conv(y)
  156. return y
  157. class CSPStage(nn.Module):
  158. def __init__(self,
  159. ch_in,
  160. ch_out,
  161. n,
  162. block_fn='BasicBlock_3x3_Reverse',
  163. ch_hidden_ratio=1.0,
  164. act='silu',
  165. spp=False):
  166. super(CSPStage, self).__init__()
  167. split_ratio = 2
  168. ch_first = int(ch_out // split_ratio)
  169. ch_mid = int(ch_out - ch_first)
  170. self.conv1 = Conv(ch_in, ch_first, 1)
  171. self.conv2 = Conv(ch_in, ch_mid, 1)
  172. self.convs = nn.Sequential()
  173. next_ch_in = ch_mid
  174. for i in range(n):
  175. if block_fn == 'BasicBlock_3x3_Reverse':
  176. self.convs.add_module(
  177. str(i),
  178. BasicBlock_3x3_Reverse(next_ch_in,
  179. ch_hidden_ratio,
  180. ch_mid,
  181. shortcut=True))
  182. else:
  183. raise NotImplementedError
  184. if i == (n - 1) // 2 and spp:
  185. self.convs.add_module('spp', SPP(ch_mid * 4, ch_mid, 1, [5, 9, 13]))
  186. next_ch_in = ch_mid
  187. self.conv3 = Conv(ch_mid * n + ch_first, ch_out, 1)
  188. def forward(self, x):
  189. y1 = self.conv1(x)
  190. y2 = self.conv2(x)
  191. mid_out = [y1]
  192. for conv in self.convs:
  193. y2 = conv(y2)
  194. mid_out.append(y2)
  195. y = torch.cat(mid_out, axis=1)
  196. y = self.conv3(y)
  197. return y


3.1 修改一

第一还是建立文件,我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹( !然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可。


3.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( ,然后在其内部导入我们的检测头如下图所示。


3.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( !


3.4 修改四

按照我的添加在parse_model里添加即可。

到此就修改完成了,大家可以复制下面的yaml文件运行。


四、 Swintransformer的核心代码和使用方式

Swintransformer的核心代码,使用方式看后面介绍。

  1. import torch
  2. import torch.nn as nn
  3. import torch.nn.functional as F
  4. import torch.utils.checkpoint as checkpoint
  5. import numpy as np
  6. from timm.models.layers import DropPath, to_2tuple, trunc_normal_
  7. __all__ = ['SwinTransformer']
  8. class Mlp(nn.Module):
  9. """ Multilayer perceptron."""
  10. def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
  11. super().__init__()
  12. out_features = out_features or in_features
  13. hidden_features = hidden_features or in_features
  14. self.fc1 = nn.Linear(in_features, hidden_features)
  15. self.act = act_layer()
  16. self.fc2 = nn.Linear(hidden_features, out_features)
  17. self.drop = nn.Dropout(drop)
  18. def forward(self, x):
  19. x = self.fc1(x)
  20. x = self.act(x)
  21. x = self.drop(x)
  22. x = self.fc2(x)
  23. x = self.drop(x)
  24. return x
  25. def window_partition(x, window_size):
  26. """
  27. Args:
  28. x: (B, H, W, C)
  29. window_size (int): window size
  30. Returns:
  31. windows: (num_windows*B, window_size, window_size, C)
  32. """
  33. B, H, W, C = x.shape
  34. x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
  35. windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
  36. return windows
  37. def window_reverse(windows, window_size, H, W):
  38. """
  39. Args:
  40. windows: (num_windows*B, window_size, window_size, C)
  41. window_size (int): Window size
  42. H (int): Height of image
  43. W (int): Width of image
  44. Returns:
  45. x: (B, H, W, C)
  46. """
  47. B = int(windows.shape[0] / (H * W / window_size / window_size))
  48. x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
  49. x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
  50. return x
  51. class WindowAttention(nn.Module):
  52. """ Window based multi-head self attention (W-MSA) module with relative position bias.
  53. It supports both of shifted and non-shifted window.
  54. Args:
  55. dim (int): Number of input channels.
  56. window_size (tuple[int]): The height and width of the window.
  57. num_heads (int): Number of attention heads.
  58. qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
  59. qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set
  60. attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0
  61. proj_drop (float, optional): Dropout ratio of output. Default: 0.0
  62. """
  63. def __init__(self, dim, window_size, num_heads, qkv_bias=True, qk_scale=None, attn_drop=0., proj_drop=0.):
  64. super().__init__()
  65. self.dim = dim
  66. self.window_size = window_size # Wh, Ww
  67. self.num_heads = num_heads
  68. head_dim = dim // num_heads
  69. self.scale = qk_scale or head_dim ** -0.5
  70. # define a parameter table of relative position bias
  71. self.relative_position_bias_table = nn.Parameter(
  72. torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)) # 2*Wh-1 * 2*Ww-1, nH
  73. # get pair-wise relative position index for each token inside the window
  74. coords_h = torch.arange(self.window_size[0])
  75. coords_w = torch.arange(self.window_size[1])
  76. coords = torch.stack(torch.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
  77. coords_flatten = torch.flatten(coords, 1) # 2, Wh*Ww
  78. relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :] # 2, Wh*Ww, Wh*Ww
  79. relative_coords = relative_coords.permute(1, 2, 0).contiguous() # Wh*Ww, Wh*Ww, 2
  80. relative_coords[:, :, 0] += self.window_size[0] - 1 # shift to start from 0
  81. relative_coords[:, :, 1] += self.window_size[1] - 1
  82. relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
  83. relative_position_index = relative_coords.sum(-1) # Wh*Ww, Wh*Ww
  84. self.register_buffer("relative_position_index", relative_position_index)
  85. self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
  86. self.attn_drop = nn.Dropout(attn_drop)
  87. self.proj = nn.Linear(dim, dim)
  88. self.proj_drop = nn.Dropout(proj_drop)
  89. trunc_normal_(self.relative_position_bias_table, std=.02)
  90. self.softmax = nn.Softmax(dim=-1)
  91. def forward(self, x, mask=None):
  92. """ Forward function.
  93. Args:
  94. x: input features with shape of (num_windows*B, N, C)
  95. mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None
  96. """
  97. B_, N, C = x.shape
  98. qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
  99. q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
  100. q = q * self.scale
  101. attn = (q @ k.transpose(-2, -1))
  102. relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view(
  103. self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1) # Wh*Ww,Wh*Ww,nH
  104. relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous() # nH, Wh*Ww, Wh*Ww
  105. attn = attn + relative_position_bias.unsqueeze(0)
  106. if mask is not None:
  107. nW = mask.shape[0]
  108. attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)
  109. attn = attn.view(-1, self.num_heads, N, N)
  110. attn = self.softmax(attn)
  111. else:
  112. attn = self.softmax(attn)
  113. attn = self.attn_drop(attn)
  114. x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
  115. x = self.proj(x)
  116. x = self.proj_drop(x)
  117. return x
  118. class SwinTransformerBlock(nn.Module):
  119. """ Swin Transformer Block.
  120. Args:
  121. dim (int): Number of input channels.
  122. num_heads (int): Number of attention heads.
  123. window_size (int): Window size.
  124. shift_size (int): Shift size for SW-MSA.
  125. mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
  126. qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
  127. qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
  128. drop (float, optional): Dropout rate. Default: 0.0
  129. attn_drop (float, optional): Attention dropout rate. Default: 0.0
  130. drop_path (float, optional): Stochastic depth rate. Default: 0.0
  131. act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
  132. norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
  133. """
  134. def __init__(self, dim, num_heads, window_size=7, shift_size=0,
  135. mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0., attn_drop=0., drop_path=0.,
  136. act_layer=nn.GELU, norm_layer=nn.LayerNorm):
  137. super().__init__()
  138. self.dim = dim
  139. self.num_heads = num_heads
  140. self.window_size = window_size
  141. self.shift_size = shift_size
  142. self.mlp_ratio = mlp_ratio
  143. assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"
  144. self.norm1 = norm_layer(dim)
  145. self.attn = WindowAttention(
  146. dim, window_size=to_2tuple(self.window_size), num_heads=num_heads,
  147. qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
  148. self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
  149. self.norm2 = norm_layer(dim)
  150. mlp_hidden_dim = int(dim * mlp_ratio)
  151. self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
  152. self.H = None
  153. self.W = None
  154. def forward(self, x, mask_matrix):
  155. """ Forward function.
  156. Args:
  157. x: Input feature, tensor size (B, H*W, C).
  158. H, W: Spatial resolution of the input feature.
  159. mask_matrix: Attention mask for cyclic shift.
  160. """
  161. B, L, C = x.shape
  162. H, W = self.H, self.W
  163. assert L == H * W, "input feature has wrong size"
  164. shortcut = x
  165. x = self.norm1(x)
  166. x = x.view(B, H, W, C)
  167. # pad feature maps to multiples of window size
  168. pad_l = pad_t = 0
  169. pad_r = (self.window_size - W % self.window_size) % self.window_size
  170. pad_b = (self.window_size - H % self.window_size) % self.window_size
  171. x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))
  172. _, Hp, Wp, _ = x.shape
  173. # cyclic shift
  174. if self.shift_size > 0:
  175. shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
  176. attn_mask = mask_matrix.type(x.dtype)
  177. else:
  178. shifted_x = x
  179. attn_mask = None
  180. # partition windows
  181. x_windows = window_partition(shifted_x, self.window_size) # nW*B, window_size, window_size, C
  182. x_windows = x_windows.view(-1, self.window_size * self.window_size, C) # nW*B, window_size*window_size, C
  183. # W-MSA/SW-MSA
  184. attn_windows = self.attn(x_windows, mask=attn_mask) # nW*B, window_size*window_size, C
  185. # merge windows
  186. attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
  187. shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp) # B H' W' C
  188. # reverse cyclic shift
  189. if self.shift_size > 0:
  190. x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))
  191. else:
  192. x = shifted_x
  193. if pad_r > 0 or pad_b > 0:
  194. x = x[:, :H, :W, :].contiguous()
  195. x = x.view(B, H * W, C)
  196. # FFN
  197. x = shortcut + self.drop_path(x)
  198. x = x + self.drop_path(self.mlp(self.norm2(x)))
  199. return x
  200. class PatchMerging(nn.Module):
  201. """ Patch Merging Layer
  202. Args:
  203. dim (int): Number of input channels.
  204. norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
  205. """
  206. def __init__(self, dim, norm_layer=nn.LayerNorm):
  207. super().__init__()
  208. self.dim = dim
  209. self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
  210. self.norm = norm_layer(4 * dim)
  211. def forward(self, x, H, W):
  212. """ Forward function.
  213. Args:
  214. x: Input feature, tensor size (B, H*W, C).
  215. H, W: Spatial resolution of the input feature.
  216. """
  217. B, L, C = x.shape
  218. assert L == H * W, "input feature has wrong size"
  219. x = x.view(B, H, W, C)
  220. # padding
  221. pad_input = (H % 2 == 1) or (W % 2 == 1)
  222. if pad_input:
  223. x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))
  224. x0 = x[:, 0::2, 0::2, :] # B H/2 W/2 C
  225. x1 = x[:, 1::2, 0::2, :] # B H/2 W/2 C
  226. x2 = x[:, 0::2, 1::2, :] # B H/2 W/2 C
  227. x3 = x[:, 1::2, 1::2, :] # B H/2 W/2 C
  228. x = torch.cat([x0, x1, x2, x3], -1) # B H/2 W/2 4*C
  229. x = x.view(B, -1, 4 * C) # B H/2*W/2 4*C
  230. x = self.norm(x)
  231. x = self.reduction(x)
  232. return x
  233. class BasicLayer(nn.Module):
  234. """ A basic Swin Transformer layer for one stage.
  235. Args:
  236. dim (int): Number of feature channels
  237. depth (int): Depths of this stage.
  238. num_heads (int): Number of attention head.
  239. window_size (int): Local window size. Default: 7.
  240. mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.
  241. qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
  242. qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
  243. drop (float, optional): Dropout rate. Default: 0.0
  244. attn_drop (float, optional): Attention dropout rate. Default: 0.0
  245. drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0
  246. norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
  247. downsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: None
  248. use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.
  249. """
  250. def __init__(self,
  251. dim,
  252. depth,
  253. num_heads,
  254. window_size=7,
  255. mlp_ratio=4.,
  256. qkv_bias=True,
  257. qk_scale=None,
  258. drop=0.,
  259. attn_drop=0.,
  260. drop_path=0.,
  261. norm_layer=nn.LayerNorm,
  262. downsample=None,
  263. use_checkpoint=False):
  264. super().__init__()
  265. self.window_size = window_size
  266. self.shift_size = window_size // 2
  267. self.depth = depth
  268. self.use_checkpoint = use_checkpoint
  269. # build blocks
  270. self.blocks = nn.ModuleList([
  271. SwinTransformerBlock(
  272. dim=dim,
  273. num_heads=num_heads,
  274. window_size=window_size,
  275. shift_size=0 if (i % 2 == 0) else window_size // 2,
  276. mlp_ratio=mlp_ratio,
  277. qkv_bias=qkv_bias,
  278. qk_scale=qk_scale,
  279. drop=drop,
  280. attn_drop=attn_drop,
  281. drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
  282. norm_layer=norm_layer)
  283. for i in range(depth)])
  284. # patch merging layer
  285. if downsample is not None:
  286. self.downsample = downsample(dim=dim, norm_layer=norm_layer)
  287. else:
  288. self.downsample = None
  289. def forward(self, x, H, W):
  290. """ Forward function.
  291. Args:
  292. x: Input feature, tensor size (B, H*W, C).
  293. H, W: Spatial resolution of the input feature.
  294. """
  295. # calculate attention mask for SW-MSA
  296. Hp = int(np.ceil(H / self.window_size)) * self.window_size
  297. Wp = int(np.ceil(W / self.window_size)) * self.window_size
  298. img_mask = torch.zeros((1, Hp, Wp, 1), device=x.device) # 1 Hp Wp 1
  299. h_slices = (slice(0, -self.window_size),
  300. slice(-self.window_size, -self.shift_size),
  301. slice(-self.shift_size, None))
  302. w_slices = (slice(0, -self.window_size),
  303. slice(-self.window_size, -self.shift_size),
  304. slice(-self.shift_size, None))
  305. cnt = 0
  306. for h in h_slices:
  307. for w in w_slices:
  308. img_mask[:, h, w, :] = cnt
  309. cnt += 1
  310. mask_windows = window_partition(img_mask, self.window_size) # nW, window_size, window_size, 1
  311. mask_windows = mask_windows.view(-1, self.window_size * self.window_size)
  312. attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
  313. attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))
  314. for blk in self.blocks:
  315. blk.H, blk.W = H, W
  316. if self.use_checkpoint:
  317. x = checkpoint.checkpoint(blk, x, attn_mask)
  318. else:
  319. x = blk(x, attn_mask)
  320. if self.downsample is not None:
  321. x_down = self.downsample(x, H, W)
  322. Wh, Ww = (H + 1) // 2, (W + 1) // 2
  323. return x, H, W, x_down, Wh, Ww
  324. else:
  325. return x, H, W, x, H, W
  326. class PatchEmbed(nn.Module):
  327. """ Image to Patch Embedding
  328. Args:
  329. patch_size (int): Patch token size. Default: 4.
  330. in_chans (int): Number of input image channels. Default: 3.
  331. embed_dim (int): Number of linear projection output channels. Default: 96.
  332. norm_layer (nn.Module, optional): Normalization layer. Default: None
  333. """
  334. def __init__(self, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):
  335. super().__init__()
  336. patch_size = to_2tuple(patch_size)
  337. self.patch_size = patch_size
  338. self.in_chans = in_chans
  339. self.embed_dim = embed_dim
  340. self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
  341. if norm_layer is not None:
  342. self.norm = norm_layer(embed_dim)
  343. else:
  344. self.norm = None
  345. def forward(self, x):
  346. """Forward function."""
  347. # padding
  348. _, _, H, W = x.size()
  349. if W % self.patch_size[1] != 0:
  350. x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
  351. if H % self.patch_size[0] != 0:
  352. x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))
  353. x = self.proj(x) # B C Wh Ww
  354. if self.norm is not None:
  355. Wh, Ww = x.size(2), x.size(3)
  356. x = x.flatten(2).transpose(1, 2)
  357. x = self.norm(x)
  358. x = x.transpose(1, 2).view(-1, self.embed_dim, Wh, Ww)
  359. return x
  360. class SwinTransformer(nn.Module):
  361. """ Swin Transformer backbone.
  362. A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows` -
  363. https://arxiv.org/pdf/2103.14030
  364. Args:
  365. pretrain_img_size (int): Input image size for training the pretrained model,
  366. used in absolute postion embedding. Default 224.
  367. patch_size (int | tuple(int)): Patch size. Default: 4.
  368. in_chans (int): Number of input image channels. Default: 3.
  369. embed_dim (int): Number of linear projection output channels. Default: 96.
  370. depths (tuple[int]): Depths of each Swin Transformer stage.
  371. num_heads (tuple[int]): Number of attention head of each stage.
  372. window_size (int): Window size. Default: 7.
  373. mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.
  374. qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
  375. qk_scale (float): Override default qk scale of head_dim ** -0.5 if set.
  376. drop_rate (float): Dropout rate.
  377. attn_drop_rate (float): Attention dropout rate. Default: 0.
  378. drop_path_rate (float): Stochastic depth rate. Default: 0.2.
  379. norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
  380. ape (bool): If True, add absolute position embedding to the patch embedding. Default: False.
  381. patch_norm (bool): If True, add normalization after patch embedding. Default: True.
  382. out_indices (Sequence[int]): Output from which stages.
  383. frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
  384. -1 means not freezing any parameters.
  385. use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.
  386. """
  387. def __init__(self,
  388. factor=0.5,
  389. depth_factor=0.5,
  390. pretrain_img_size=224,
  391. patch_size=4,
  392. in_chans=3,
  393. embed_dim=96,
  394. depths=[2, 2, 6, 2],
  395. num_heads=[3, 6, 12, 24],
  396. window_size=7,
  397. mlp_ratio=4.,
  398. qkv_bias=True,
  399. qk_scale=None,
  400. drop_rate=0.,
  401. attn_drop_rate=0.,
  402. drop_path_rate=0.2,
  403. norm_layer=nn.LayerNorm,
  404. ape=False,
  405. patch_norm=True,
  406. out_indices=(0, 1, 2, 3),
  407. frozen_stages=-1,
  408. use_checkpoint=False):
  409. super().__init__()
  410. embed_dim = int(embed_dim * factor)
  411. depths = [max(1, int(dim * depth_factor)) for dim in depths]
  412. self.pretrain_img_size = pretrain_img_size
  413. self.num_layers = len(depths)
  414. self.embed_dim = embed_dim
  415. self.ape = ape
  416. self.patch_norm = patch_norm
  417. self.out_indices = out_indices
  418. self.frozen_stages = frozen_stages
  419. # split image into non-overlapping patches
  420. self.patch_embed = PatchEmbed(
  421. patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim,
  422. norm_layer=norm_layer if self.patch_norm else None)
  423. # absolute position embedding
  424. if self.ape:
  425. pretrain_img_size = to_2tuple(pretrain_img_size)
  426. patch_size = to_2tuple(patch_size)
  427. patches_resolution = [pretrain_img_size[0] // patch_size[0], pretrain_img_size[1] // patch_size[1]]
  428. self.absolute_pos_embed = nn.Parameter(torch.zeros(1, embed_dim, patches_resolution[0], patches_resolution[1]))
  429. trunc_normal_(self.absolute_pos_embed, std=.02)
  430. self.pos_drop = nn.Dropout(p=drop_rate)
  431. # stochastic depth
  432. dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))] # stochastic depth decay rule
  433. # build layers
  434. self.layers = nn.ModuleList()
  435. for i_layer in range(self.num_layers):
  436. layer = BasicLayer(
  437. dim=int(embed_dim * 2 ** i_layer),
  438. depth=depths[i_layer],
  439. num_heads=num_heads[i_layer],
  440. window_size=window_size,
  441. mlp_ratio=mlp_ratio,
  442. qkv_bias=qkv_bias,
  443. qk_scale=qk_scale,
  444. drop=drop_rate,
  445. attn_drop=attn_drop_rate,
  446. drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
  447. norm_layer=norm_layer,
  448. downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,
  449. use_checkpoint=use_checkpoint)
  450. self.layers.append(layer)
  451. num_features = [int(embed_dim * 2 ** i) for i in range(self.num_layers)]
  452. self.num_features = num_features
  453. # add a norm layer for each output
  454. for i_layer in out_indices:
  455. layer = norm_layer(num_features[i_layer])
  456. layer_name = f'norm{i_layer}'
  457. self.add_module(layer_name, layer)
  458. self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
  459. def forward(self, x):
  460. """Forward function."""
  461. x = self.patch_embed(x)
  462. Wh, Ww = x.size(2), x.size(3)
  463. if self.ape:
  464. # interpolate the position embedding to the corresponding size
  465. absolute_pos_embed = F.interpolate(self.absolute_pos_embed, size=(Wh, Ww), mode='bicubic')
  466. x = (x + absolute_pos_embed).flatten(2).transpose(1, 2) # B Wh*Ww C
  467. else:
  468. x = x.flatten(2).transpose(1, 2)
  469. x = self.pos_drop(x)
  470. outs = []
  471. for i in range(self.num_layers):
  472. layer = self.layers[i]
  473. x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
  474. if i in self.out_indices:
  475. norm_layer = getattr(self, f'norm{i}')
  476. x_out = norm_layer(x_out)
  477. out = x_out.view(-1, H, W, self.num_features[i]).permute(0, 3, 1, 2).contiguous()
  478. outs.append(out)
  479. return outs

​​


4.1 修改一

第一还是建立文件,我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹 !然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可。


4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( ,然后在其内部导入我们的检测头如下图所示。


4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( !


4.4 修改四

添加如下两行代码!!!


4.5 修改五

找到七百多行大概把具体看图片,按照图片来修改就行,添加红框内的部分,注意没有()只是 函数 名。

  1. elif m in {自行添加对应的模型即可,下面都是一样的}: # 这段代码是自己添加的原代码中没有
  2. m = m(*args)
  3. c2 = m.width_list # 返回通道列表
  4. backbone = True


4.6 修改六

下面的两个红框内都是需要改动的。

  1. if isinstance(c2, list):
  2. m_ = m
  3. m_.backbone = True
  4. else:
  5. m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
  6. t = str(m)[8:-2].replace('__main__.', '') # module type
  7. m.np = sum(x.numel() for x in m_.parameters()) # number params
  8. m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type


4.7 修改七

如下的也需要修改,全部按照我的来。

代码如下把原先的代码替换了即可。

  1. if verbose:
  2. LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
  3. save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
  4. layers.append(m_)
  5. if i == 0:
  6. ch = []
  7. if isinstance(c2, list):
  8. ch.extend(c2)
  9. if len(c2) != 5:
  10. ch.insert(0, 0)
  11. else:
  12. ch.append(c2)


4.8 修改八

修改七和前面的都不太一样,需要修改前向传播中的一个部分, 已经离开了parse_model方法了。

可以在图片中开代码行数,没有离开task.py文件都是同一个文件。 同时这个部分有好几个前向传播都很相似,大家不要看错了, 是70多行左右的!!!,同时我后面提供了代码,大家直接复制粘贴即可,有时间我针对这里会出一个视频。

​​

代码如下->

  1. def _predict_once(self, x, profile=False, visualize=False, embed=None):
  2. """
  3. Perform a forward pass through the network.
  4. Args:
  5. x (torch.Tensor): The input tensor to the model.
  6. profile (bool): Print the computation time of each layer if True, defaults to False.
  7. visualize (bool): Save the feature maps of the model if True, defaults to False.
  8. embed (list, optional): A list of feature vectors/embeddings to return.
  9. Returns:
  10. (torch.Tensor): The last output of the model.
  11. """
  12. y, dt, embeddings = [], [], [] # outputs
  13. for m in self.model:
  14. if m.f != -1: # if not from previous layer
  15. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  16. if profile:
  17. self._profile_one_layer(m, x, dt)
  18. if hasattr(m, 'backbone'):
  19. x = m(x)
  20. if len(x) != 5: # 0 - 5
  21. x.insert(0, None)
  22. for index, i in enumerate(x):
  23. if index in self.save:
  24. y.append(i)
  25. else:
  26. y.append(None)
  27. x = x[-1] # 最后一个输出传给下一层
  28. else:
  29. x = m(x) # run
  30. y.append(x if m.i in self.save else None) # save output
  31. if visualize:
  32. feature_visualization(x, m.type, m.i, save_dir=visualize)
  33. if embed and m.i in embed:
  34. embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
  35. if m.i == max(embed):
  36. return torch.unbind(torch.cat(embeddings, 1), dim=0)
  37. return x

到这里就完成了修改部分,但是这里面细节很多,大家千万要注意不要替换多余的代码,导致报错,也不要拉下任何一部,都会导致运行失败,而且报错很难排查!!!很难排查!!!


4.9 修改九

我们找到如下文件'ultralytics/utils/torch_utils.py'按照如下的图片进行修改,否则容易打印不出来计算量。


五、融合使用教程

训练信息:YOLO11-SwinTransformer-RepGFPN summary: 353 layers, 2,736,687 parameters, 2,736,671 gradients, 6.3 GFLOPs

  1. # Ultralytics YOLO 🚀, AGPL-3.0 license
  2. # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
  3. # Parameters
  4. nc: 80 # number of classes
  5. scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  6. # [depth, width, max_channels]
  7. n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  8. s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  9. m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  10. l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  11. x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
  12. # 下面 [-1, 1, LSKNet, [0.250.5]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv111.5大家根据自己训练的YOLO版本设定即可.
  13. # 0.5对应的是模型的深度系数
  14. # YOLO11n backbone
  15. backbone:
  16. # [from, repeats, module, args]
  17. - [-1, 1, SwinTransformer, [0.25,0.5]] # 0-4 P1/2
  18. # 0 1P2 2P3 3P4 4P5
  19. - [-1, 1, SPPF, [1024, 5]] # 5
  20. - [-1, 2, C2PSA, [1024]] # 6
  21. # YOLO11n head
  22. head:
  23. - [-1, 1, Conv, [512, 1, 1]] # 7
  24. - [3, 1, Conv, [512, 3, 2]]
  25. - [[-1, -2], 1, Concat, [1]]
  26. - [-1, 2, C3k2, [512, False]] # 10
  27. - [-1, 1, nn.Upsample, [None, 2, 'nearest']] #11
  28. - [2, 1, Conv, [256, 3, 2]] # 12
  29. - [[-2, -1, 3], 1, Concat, [1]]
  30. - [-1, 2, C3k2, [512, False]] # 14
  31. - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  32. - [[-1, 2], 1, Concat, [1]]
  33. - [-1, 2, C3k2, [256, False]] # 17
  34. - [-1, 1, Conv, [256, 3, 2]]
  35. - [[-1, 14], 1, Concat, [1]]
  36. - [-1, 2, C3k2, [512, False]] # 20
  37. - [14, 1, Conv, [256, 3, 2]] # 21
  38. - [20, 1, Conv, [256, 3, 2]] # 22
  39. - [[10, 21, -1], 1, Concat, [1]]
  40. - [-1, 2, C3k2, [1024, True]] # 24
  41. - [[17, 20, 24], 1, Detect, [nc]] # Detect(P3, P4, P5)


六、成功运行截图


七、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分97分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~