YOLOv11改进 | 主干/Backbone篇 | CSWinTransformer交叉形窗口目标检测网络(适配yolov11全系列版本)

一、本文介绍

本文给大家带来的改进机制是 CSWin Transformer 其基于 Transformer 架构,创新性地引入了 交叉形窗口自注意力机制 ,用于有效地 并行 处理图像的水平和垂直条带,形成交叉形窗口以提高计算效率。它还提出了 局部增强位置编码(LePE) ,更好地处理局部位置信息,我将其替换YOLOv10的特征提取网络,用于提取更有用的特征。经过我的实验该主干网络确实能够涨点在大中小三种物体检测上, 同时该主干网络也提供多种版本 ,大家可以在 源代码 中进行修改版本的使用。 本文通过介绍其主要框架原理,然后教大家如何添加该网络结构到网络模型中。

(本文内容可根据yolov11的N、S、M、L、X进行二次缩放,轻量化更上一层)


目录

一、本文介绍

二、CSWin Transformer原理

2.1 CSWin Transformer的基本原理

2.2 交叉形窗口自注意力

2.3 局部增强位置编码

2.4 下游任务友好

三、CSwinTransformer的核心代码

四、手把手教你添加CSwinTransformer机制

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

注意!!! 额外的修改!

打印计算量问题解决方案

注意事项!!!

五、CSwinTransformer的yaml文件

5.1 CSwinTransformer的yaml文件版本1

5.2 训练文件

六、成功运行记录

七、本文总结


二、CSWin Transformer原理

论文地址: 论文官方地址

代码地址: 官方代码地址


2.1 CSWin Transformer的基本原理

CSWin Transformer 基于Transformer架构,创新性地引入了 交叉形窗口自注意力机制 ,用于有效地 并行处理 图像的水平和垂直条带,形成交叉形窗口以提高计算效率。它还提出了 局部增强位置编码(LePE) ,更好地处理局部位置信息,支持任意输入分辨率,并对下游任务友好。这些创新使CSWin Transformer在视觉任务上,如图像分类和目标检测,显示出优于现有技术的 性能

CSWin Transformer 的基本原理 可以总结如下:

1. 交叉形窗口自注意力: 创新地采用了在水平和垂直方向上形成交叉形窗口的 自注意力机制 ,提高了处理效率。
2. 局部增强位置编码(LePE): 新颖的 位置编码 方案,更好地处理局部位置信息,支持任意大小的输入分辨率。
3. 下游任务友好: LePE使得CSWin Transformer尤其适用于各种后续视觉处理任务。


2.2 交叉形窗口自注意力

交叉形窗口自注意力 是CSWin Transformer的核心特征之一,它通过 将多头注意力分成两组来并行处理图像的水平和垂直条带 。这种机制允许 模型 在交叉的区域内聚焦重要的特征,同时限制了全局自注意力的高计算成本。这样不仅保持了局部和全局信息的平衡,而且还提高了处理速度和效率。

下图展示了 CSWin Transformer中不同自注意力机制的对比:

图解说明了CSWin Transformer如何通过在水平和垂直方向上拆分多头注意力,来并行处理形成交叉窗口结构。CSWin采用了一个创新的自注意力机制,通过将多头注意力拆分成两组来同时处理水平和垂直的条带,形成 交叉形窗口 。这种设计能够在计算成本和模型性能之间取得更好的平衡。图中展示了从全注意力到局部注意力的不同变体,以及CSWin特有的自注意力策略,这对于提高模型效率和精度都是至关重要的。


2.3 局部增强位置编码

局部增强位置编码(LePE) 是CSWin Transformer中的一种新型位置编码机制。它改善了现有编码方案处理局部位置信息的能力。与传统位置编码不同,LePE专门设计来 增强模型对于图像局部区域的感知能力 ,支持任意大小的输入分辨率。这使得CSWin Transformer在处理各种尺寸的输入图像时更为灵活和有效,特别适合各种视觉任务中的下游应用。

这张图展示了 CSWin Transformer的整体架构和其中一个CSWin Transformer块的细节

图中展示了 交叉形窗口自注意力 局部增强位置编码 这两种机制是如何集成在CSWin Transformer的不同阶段中,以及在单个Transformer块中的具体实现。这些设计共同支持了模型在进行视觉任务处理时的高效性和有效性。模型分为四个阶段,每个阶段由多个CSWin Transformer块组成,每个块包含了交叉形窗口自注意力和局部增强位置编码。随着阶段的推进,特征图的维度逐渐增大,通道数也相应增加,这允许网络逐渐捕获更复杂的特征。右侧详细描绘了一个CSWin Transformer块的内部结构,展示了MLP(多层感知机)、LN(层归一化)以及核心的交叉形窗口自注意力机制。

下面这张图 对比了不同的位置编码机制 ,如APE、CPE、RPE以及CSWin Transformer中采用的LePE。图中展示了 LePE是如何直接作用于自注意力机制中的V(值)部分 ,并且作为一个并行模块存在的。LePE的引入使得位置信息能够更有效地融入到自注意力计算中,与其他位置编码机制相比,它提供了对局部位置信息的更强处理能力。

LePE的设计允许位置信息更直接地融入到自注意力计算中,与传统的位置编码方法相比,LePE为模型提供了更精细的局部位置感知能力。这在处理视觉任务时是极其有益的,因为它帮助模型更好地理解图像中各个部分的相对位置关系。


2.4 下游任务友好

下游任务友好性 是指模型或技术易于被应用于特定任务的后续步骤或进一步的处理中。对于CSWin Transformer,其 局部增强位置编码(LePE)的设计 支持任意分辨率的输入,使得模型能够更容易地适应不同的视觉任务,如图像分类、目标检测和语义分割。这种灵活性意味着CSWin Transformer可以直接应用于各种不同分辨率的数据集,而无需进行复杂的重新调整或额外的预处理步骤,从而降低了对下游任务的应用难度。


三、CSwinTransformer的核心代码

代码使用方式看章节四

  1. # ------------------------------------------
  2. # CSWin Transformer
  3. # Copyright (c) Microsoft Corporation.
  4. # Licensed under the MIT License.
  5. # written By Xiaoyi Dong
  6. # ------------------------------------------
  7. import torch
  8. import torch.nn as nn
  9. from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
  10. from timm.models.layers import DropPath, trunc_normal_
  11. from timm.models.registry import register_model
  12. from einops.layers.torch import Rearrange
  13. import torch.utils.checkpoint as checkpoint
  14. import numpy as np
  15. def _make_divisible(v, divisor, min_value=None):
  16. """
  17. This function is taken from the original tf repo.
  18. It ensures that all layers have a channel number that is divisible by 8
  19. It can be seen here:
  20. https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
  21. :param v:
  22. :param divisor:
  23. :param min_value:
  24. :return:
  25. """
  26. if min_value is None:
  27. min_value = divisor
  28. new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
  29. # Make sure that round down does not go down by more than 10%.
  30. if new_v < 0.9 * v:
  31. new_v += divisor
  32. return new_v
  33. def _cfg(url='', **kwargs):
  34. return {
  35. 'url': url,
  36. 'num_classes': 1000, 'input_size': (3, 640, 640), 'pool_size': None,
  37. 'crop_pct': .9, 'interpolation': 'bicubic',
  38. 'mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD,
  39. 'first_conv': 'patch_embed.proj', 'classifier': 'head',
  40. **kwargs
  41. }
  42. default_cfgs = {
  43. 'cswin_224': _cfg(),
  44. 'cswin_384': _cfg(
  45. crop_pct=1.0
  46. ),
  47. }
  48. class Mlp(nn.Module):
  49. def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
  50. super().__init__()
  51. out_features = out_features or in_features
  52. hidden_features = hidden_features or in_features
  53. self.fc1 = nn.Linear(in_features, hidden_features)
  54. self.act = act_layer()
  55. self.fc2 = nn.Linear(hidden_features, out_features)
  56. self.drop = nn.Dropout(drop)
  57. def forward(self, x):
  58. x = self.fc1(x)
  59. x = self.act(x)
  60. x = self.drop(x)
  61. x = self.fc2(x)
  62. x = self.drop(x)
  63. return x
  64. class LePEAttention(nn.Module):
  65. def __init__(self, dim, resolution, idx, split_size=7, dim_out=None, num_heads=8, attn_drop=0., proj_drop=0.,
  66. qk_scale=None):
  67. super().__init__()
  68. self.dim = dim
  69. self.dim_out = dim_out or dim
  70. self.resolution = resolution
  71. self.split_size = split_size
  72. self.num_heads = num_heads
  73. head_dim = dim // num_heads
  74. # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
  75. self.scale = qk_scale or head_dim ** -0.5
  76. if idx == -1:
  77. H_sp, W_sp = self.resolution, self.resolution
  78. elif idx == 0:
  79. H_sp, W_sp = self.resolution, self.split_size
  80. elif idx == 1:
  81. W_sp, H_sp = self.resolution, self.split_size
  82. else:
  83. print("ERROR MODE", idx)
  84. exit(0)
  85. self.H_sp = H_sp
  86. self.W_sp = W_sp
  87. stride = 1
  88. self.get_v = nn.Conv2d(dim, dim, kernel_size=3, stride=1, padding=1, groups=dim)
  89. self.attn_drop = nn.Dropout(attn_drop)
  90. def im2cswin(self, x):
  91. B, N, C = x.shape
  92. H = W = int(np.sqrt(N))
  93. x = x.transpose(-2, -1).contiguous().view(B, C, H, W)
  94. x = img2windows(x, self.H_sp, self.W_sp)
  95. x = x.reshape(-1, self.H_sp * self.W_sp, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3).contiguous()
  96. return x
  97. def get_lepe(self, x, func):
  98. B, N, C = x.shape
  99. H = W = int(np.sqrt(N))
  100. x = x.transpose(-2, -1).contiguous().view(B, C, H, W)
  101. H_sp, W_sp = self.H_sp, self.W_sp
  102. x = x.view(B, C, H // H_sp, H_sp, W // W_sp, W_sp)
  103. x = x.permute(0, 2, 4, 1, 3, 5).contiguous().reshape(-1, C, H_sp, W_sp) ### B', C, H', W'
  104. lepe = func(x) ### B', C, H', W'
  105. lepe = lepe.reshape(-1, self.num_heads, C // self.num_heads, H_sp * W_sp).permute(0, 1, 3, 2).contiguous()
  106. x = x.reshape(-1, self.num_heads, C // self.num_heads, self.H_sp * self.W_sp).permute(0, 1, 3, 2).contiguous()
  107. return x, lepe
  108. def forward(self, qkv):
  109. """
  110. x: B L C
  111. """
  112. q, k, v = qkv[0], qkv[1], qkv[2]
  113. ### Img2Window
  114. H = W = self.resolution
  115. B, L, C = q.shape
  116. assert L == H * W, "flatten img_tokens has wrong size"
  117. q = self.im2cswin(q)
  118. k = self.im2cswin(k)
  119. v, lepe = self.get_lepe(v, self.get_v)
  120. q = q * self.scale
  121. attn = (q @ k.transpose(-2, -1)) # B head N C @ B head C N --> B head N N
  122. attn = nn.functional.softmax(attn, dim=-1, dtype=attn.dtype)
  123. attn = self.attn_drop(attn)
  124. x = (attn @ v) + lepe
  125. x = x.transpose(1, 2).reshape(-1, self.H_sp * self.W_sp, C) # B head N N @ B head N C
  126. ### Window2Img
  127. x = windows2img(x, self.H_sp, self.W_sp, H, W).view(B, -1, C) # B H' W' C
  128. return x
  129. class CSWinBlock(nn.Module):
  130. def __init__(self, dim, reso, num_heads,
  131. split_size=7, mlp_ratio=4., qkv_bias=False, qk_scale=None,
  132. drop=0., attn_drop=0., drop_path=0.,
  133. act_layer=nn.GELU, norm_layer=nn.LayerNorm,
  134. last_stage=False):
  135. super().__init__()
  136. self.dim = dim
  137. self.num_heads = num_heads
  138. self.patches_resolution = reso
  139. self.split_size = split_size
  140. self.mlp_ratio = mlp_ratio
  141. self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
  142. self.norm1 = norm_layer(dim)
  143. if self.patches_resolution == split_size:
  144. last_stage = True
  145. if last_stage:
  146. self.branch_num = 1
  147. else:
  148. self.branch_num = 2
  149. self.proj = nn.Linear(dim, dim)
  150. self.proj_drop = nn.Dropout(drop)
  151. if last_stage:
  152. self.attns = nn.ModuleList([
  153. LePEAttention(
  154. dim, resolution=self.patches_resolution, idx=-1,
  155. split_size=split_size, num_heads=num_heads, dim_out=dim,
  156. qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
  157. for i in range(self.branch_num)])
  158. else:
  159. self.attns = nn.ModuleList([
  160. LePEAttention(
  161. dim // 2, resolution=self.patches_resolution, idx=i,
  162. split_size=split_size, num_heads=num_heads // 2, dim_out=dim // 2,
  163. qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
  164. for i in range(self.branch_num)])
  165. mlp_hidden_dim = int(dim * mlp_ratio)
  166. self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
  167. self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, out_features=dim, act_layer=act_layer,
  168. drop=drop)
  169. self.norm2 = norm_layer(dim)
  170. def forward(self, x):
  171. """
  172. x: B, H*W, C
  173. """
  174. H = W = self.patches_resolution
  175. B, L, C = x.shape
  176. assert L == H * W, "flatten img_tokens has wrong size"
  177. img = self.norm1(x)
  178. qkv = self.qkv(img).reshape(B, -1, 3, C).permute(2, 0, 1, 3)
  179. if self.branch_num == 2:
  180. x1 = self.attns[0](qkv[:, :, :, :C // 2])
  181. x2 = self.attns[1](qkv[:, :, :, C // 2:])
  182. attened_x = torch.cat([x1, x2], dim=2)
  183. else:
  184. attened_x = self.attns[0](qkv)
  185. attened_x = self.proj(attened_x)
  186. x = x + self.drop_path(attened_x)
  187. x = x + self.drop_path(self.mlp(self.norm2(x)))
  188. return x
  189. def img2windows(img, H_sp, W_sp):
  190. """
  191. img: B C H W
  192. """
  193. B, C, H, W = img.shape
  194. img_reshape = img.view(B, C, H // H_sp, H_sp, W // W_sp, W_sp)
  195. img_perm = img_reshape.permute(0, 2, 4, 3, 5, 1).contiguous().reshape(-1, H_sp * W_sp, C)
  196. return img_perm
  197. def windows2img(img_splits_hw, H_sp, W_sp, H, W):
  198. """
  199. img_splits_hw: B' H W C
  200. """
  201. B = int(img_splits_hw.shape[0] / (H * W / H_sp / W_sp))
  202. img = img_splits_hw.view(B, H // H_sp, W // W_sp, H_sp, W_sp, -1)
  203. img = img.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
  204. return img
  205. class Merge_Block(nn.Module):
  206. def __init__(self, dim, dim_out, norm_layer=nn.LayerNorm):
  207. super().__init__()
  208. self.conv = nn.Conv2d(dim, dim_out, 3, 2, 1)
  209. self.norm = norm_layer(dim_out)
  210. def forward(self, x):
  211. B, new_HW, C = x.shape
  212. H = W = int(np.sqrt(new_HW))
  213. x = x.transpose(-2, -1).contiguous().view(B, C, H, W)
  214. x = self.conv(x)
  215. B, C = x.shape[:2]
  216. x = x.view(B, C, -1).transpose(-2, -1).contiguous()
  217. x = self.norm(x)
  218. return x
  219. class CSWinTransformer(nn.Module):
  220. """ Vision Transformer with support for patch or hybrid CNN input stage
  221. """
  222. def __init__(self, factor=0.5, depth_factor=0.5, img_size=640, patch_size=16, in_chans=3, num_classes=1000,
  223. embed_dim=96, depth=[2, 2, 6, 2],
  224. split_size=[3, 5, 7],
  225. num_heads=12, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
  226. drop_path_rate=0., hybrid_backbone=None, norm_layer=nn.LayerNorm, use_chk=False):
  227. super().__init__()
  228. embed_dim = int(embed_dim * factor)
  229. depth = [max(1, int(dim * depth_factor)) for dim in depth]
  230. self.use_chk = use_chk
  231. self.num_classes = num_classes
  232. self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
  233. heads = num_heads
  234. self.stage1_conv_embed = nn.Sequential(
  235. nn.Conv2d(in_chans, embed_dim, 7, 4, 2),
  236. Rearrange('b c h w -> b (h w) c', h=img_size // 4, w=img_size // 4),
  237. nn.LayerNorm(embed_dim)
  238. )
  239. curr_dim = embed_dim
  240. dpr = [x.item() for x in torch.linspace(0, drop_path_rate, np.sum(depth))] # stochastic depth decay rule
  241. self.stage1 = nn.ModuleList([
  242. CSWinBlock(
  243. dim=curr_dim, num_heads=heads[0], reso=img_size // 4, mlp_ratio=mlp_ratio,
  244. qkv_bias=qkv_bias, qk_scale=qk_scale, split_size=split_size[0],
  245. drop=drop_rate, attn_drop=attn_drop_rate,
  246. drop_path=dpr[i], norm_layer=norm_layer)
  247. for i in range(depth[0])])
  248. self.merge1 = Merge_Block(curr_dim, curr_dim * 2)
  249. curr_dim = curr_dim * 2
  250. self.stage2 = nn.ModuleList(
  251. [CSWinBlock(
  252. dim=curr_dim, num_heads=heads[1], reso=img_size // 8, mlp_ratio=mlp_ratio,
  253. qkv_bias=qkv_bias, qk_scale=qk_scale, split_size=split_size[1],
  254. drop=drop_rate, attn_drop=attn_drop_rate,
  255. drop_path=dpr[np.sum(depth[:1]) + i], norm_layer=norm_layer)
  256. for i in range(depth[1])])
  257. self.merge2 = Merge_Block(curr_dim, curr_dim * 2)
  258. curr_dim = curr_dim * 2
  259. temp_stage3 = []
  260. temp_stage3.extend(
  261. [CSWinBlock(
  262. dim=curr_dim, num_heads=heads[2], reso=img_size // 16, mlp_ratio=mlp_ratio,
  263. qkv_bias=qkv_bias, qk_scale=qk_scale, split_size=split_size[2],
  264. drop=drop_rate, attn_drop=attn_drop_rate,
  265. drop_path=dpr[np.sum(depth[:2]) + i], norm_layer=norm_layer)
  266. for i in range(depth[2])])
  267. self.stage3 = nn.ModuleList(temp_stage3)
  268. self.merge3 = Merge_Block(curr_dim, curr_dim * 2)
  269. curr_dim = curr_dim * 2
  270. self.stage4 = nn.ModuleList(
  271. [CSWinBlock(
  272. dim=curr_dim, num_heads=heads[3], reso=img_size // 32, mlp_ratio=mlp_ratio,
  273. qkv_bias=qkv_bias, qk_scale=qk_scale, split_size=split_size[-1],
  274. drop=drop_rate, attn_drop=attn_drop_rate,
  275. drop_path=dpr[np.sum(depth[:-1]) + i], norm_layer=norm_layer, last_stage=True)
  276. for i in range(depth[-1])])
  277. self.norm = norm_layer(curr_dim)
  278. # Classifier head
  279. self.head = nn.Linear(curr_dim, num_classes) if num_classes > 0 else nn.Identity()
  280. trunc_normal_(self.head.weight, std=0.02)
  281. self.apply(self._init_weights)
  282. self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
  283. def _init_weights(self, m):
  284. if isinstance(m, nn.Linear):
  285. trunc_normal_(m.weight, std=.02)
  286. if isinstance(m, nn.Linear) and m.bias is not None:
  287. nn.init.constant_(m.bias, 0)
  288. elif isinstance(m, (nn.LayerNorm, nn.BatchNorm2d)):
  289. nn.init.constant_(m.bias, 0)
  290. nn.init.constant_(m.weight, 1.0)
  291. @torch.jit.ignore
  292. def no_weight_decay(self):
  293. return {'pos_embed', 'cls_token'}
  294. def get_classifier(self):
  295. return self.head
  296. def reset_classifier(self, num_classes, global_pool=''):
  297. if self.num_classes != num_classes:
  298. print('reset head to', num_classes)
  299. self.num_classes = num_classes
  300. self.head = nn.Linear(self.out_dim, num_classes) if num_classes > 0 else nn.Identity()
  301. self.head = self.head.cuda()
  302. trunc_normal_(self.head.weight, std=.02)
  303. if self.head.bias is not None:
  304. nn.init.constant_(self.head.bias, 0)
  305. def forward(self, x):
  306. B = x.shape[0]
  307. x = self.stage1_conv_embed(x)
  308. unique_tensors = {}
  309. for blk in self.stage1:
  310. if self.use_chk:
  311. x = checkpoint.checkpoint(blk, x)
  312. else:
  313. x = blk(x)
  314. y = x.reshape((x.size(0), x.size(2), int(x.size(1) ** 0.5), int(x.size(1) ** 0.5)))
  315. width, height = y.shape[2], y.shape[3]
  316. unique_tensors[(width, height)] = y
  317. for pre, blocks in zip([self.merge1, self.merge2, self.merge3],
  318. [self.stage2, self.stage3, self.stage4]):
  319. x = pre(x)
  320. for blk in blocks:
  321. if self.use_chk:
  322. x = checkpoint.checkpoint(blk, x)
  323. y = x.reshape((x.size(0), x.size(2), int(x.size(1) ** 0.5), int(x.size(1) ** 0.5)))
  324. width, height = y.shape[2], y.shape[3]
  325. unique_tensors[(width, height)] = y
  326. else:
  327. x = blk(x)
  328. y = x.reshape((x.size(0), x.size(2), int(x.size(1) ** 0.5), int(x.size(1) ** 0.5)))
  329. width, height = y.shape[2], y.shape[3]
  330. unique_tensors[(width, height)] = y
  331. result_list = list(unique_tensors.values())[-4:]
  332. return result_list
  333. def _conv_filter(state_dict, patch_size=16):
  334. """ convert patch embedding weight from manual patchify + linear proj to conv"""
  335. out_dict = {}
  336. for k, v in state_dict.items():
  337. if 'patch_embed.proj.weight' in k:
  338. v = v.reshape((v.shape[0], 3, patch_size, patch_size))
  339. out_dict[k] = v
  340. return out_dict
  341. ### 224 models
  342. @register_model
  343. def CSWin_64_12211_tiny_224(factor, depth_factor, **kwargs):
  344. model = CSWinTransformer(factor=factor, depth_factor=depth_factor, patch_size=4, embed_dim=64, depth=[1, 2, 21, 1],
  345. split_size=[1, 2, 8, 8], num_heads=[2, 4, 8, 16], mlp_ratio=4., **kwargs)
  346. model.default_cfg = default_cfgs['cswin_224']
  347. return model
  348. @register_model
  349. def CSWin_64_24322_small_224(factor, depth_factor, **kwargs):
  350. model = CSWinTransformer(factor=factor, depth_factor=depth_factor, patch_size=4, embed_dim=64, depth=[2, 4, 32, 2],
  351. split_size=[1, 2, 8, 8], num_heads=[2, 4, 8, 16], mlp_ratio=4., **kwargs)
  352. model.default_cfg = default_cfgs['cswin_224']
  353. return model
  354. @register_model
  355. def CSWin_96_24322_base_224(factor, depth_factor, **kwargs):
  356. model = CSWinTransformer(actor=factor, depth_factor=depth_factor, patch_size=4, embed_dim=96, depth=[2, 4, 32, 2],
  357. split_size=[1, 2, 8, 8], num_heads=[4, 8, 16, 32], mlp_ratio=4., **kwargs)
  358. model.default_cfg = default_cfgs['cswin_224']
  359. return model
  360. @register_model
  361. def CSWin_144_24322_large_224(factor, depth_factor, **kwargs):
  362. model = CSWinTransformer(actor=factor, depth_factor=depth_factor, patch_size=4, embed_dim=144, depth=[2, 4, 32, 2],
  363. split_size=[1, 2, 8, 8], num_heads=[6, 12, 24, 24], mlp_ratio=4., **kwargs)
  364. model.default_cfg = default_cfgs['cswin_224']
  365. return model
  366. if __name__ == '__main__':
  367. model = CSWin_64_12211_tiny_224(factor=0.25)
  368. inputs = torch.randn((1, 3, 640, 640))
  369. for i in model(inputs):
  370. print(i.size())


四、手把手教你添加CSwinTransformer机制

4.1 修改一

第一步还是建立文件,我们找到如下 ultralytics /nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹( !然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可


4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( ,然后在其内部导入我们的检测头如下图所示。


4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( !


4.4 修改四

添加如下两行代码!!!


4.5 修改五

找到七百多行大概把具体看图片,按照图片来修改就行,添加红框内的部分,注意没有()只是函数名。

  1. elif m in {自行添加对应的模型即可,下面都是一样的}:
  2. m = m(*args)
  3. c2 = m.width_list # 返回通道列表
  4. backbone = True


4.6 修改六

下面的两个红框内都是需要改动的。

  1. if isinstance(c2, list):
  2. m_ = m
  3. m_.backbone = True
  4. else:
  5. m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
  6. t = str(m)[8:-2].replace('__main__.', '') # module type
  7. m.np = sum(x.numel() for x in m_.parameters()) # number params
  8. m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type


4.7 修改七

如下的也需要修改,全部按照我的来。

代码如下把原先的代码替换了即可。

  1. if verbose:
  2. LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
  3. save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
  4. layers.append(m_)
  5. if i == 0:
  6. ch = []
  7. if isinstance(c2, list):
  8. ch.extend(c2)
  9. if len(c2) != 5:
  10. ch.insert(0, 0)
  11. else:
  12. ch.append(c2)


4.8 修改八

修改八和前面的都不太一样,需要修改前向传播中的一个部分, 已经离开了parse_model方法了。

可以在图片中开代码行数,没有离开task.py文件都是同一个文件。 同时这个部分有好几个前向传播都很相似,大家不要看错了, 是70多行左右的!!!,同时我后面提供了代码,大家直接复制粘贴即可,有时间我针对这里会出一个视频。

​​

代码如下->

  1. def _predict_once(self, x, profile=False, visualize=False, embed=None):
  2. """
  3. Perform a forward pass through the network.
  4. Args:
  5. x (torch.Tensor): The input tensor to the model.
  6. profile (bool): Print the computation time of each layer if True, defaults to False.
  7. visualize (bool): Save the feature maps of the model if True, defaults to False.
  8. embed (list, optional): A list of feature vectors/embeddings to return.
  9. Returns:
  10. (torch.Tensor): The last output of the model.
  11. """
  12. y, dt, embeddings = [], [], [] # outputs
  13. for m in self.model:
  14. if m.f != -1: # if not from previous layer
  15. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  16. if profile:
  17. self._profile_one_layer(m, x, dt)
  18. if hasattr(m, 'backbone'):
  19. x = m(x)
  20. if len(x) != 5: # 0 - 5
  21. x.insert(0, None)
  22. for index, i in enumerate(x):
  23. if index in self.save:
  24. y.append(i)
  25. else:
  26. y.append(None)
  27. x = x[-1] # 最后一个输出传给下一层
  28. else:
  29. x = m(x) # run
  30. y.append(x if m.i in self.save else None) # save output
  31. if visualize:
  32. feature_visualization(x, m.type, m.i, save_dir=visualize)
  33. if embed and m.i in embed:
  34. embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
  35. if m.i == max(embed):
  36. return torch.unbind(torch.cat(embeddings, 1), dim=0)
  37. return x

到这里就完成了修改部分,但是这里面细节很多,大家千万要注意不要替换多余的代码,导致报错,也不要拉下任何一部,都会导致运行失败,而且报错很难排查!!!很难排查!!!


注意!!! 额外的修改!

关注我的其实都知道,我大部分的修改都是一样的,这个网络需要额外的修改一步,就是s一个参数,将下面的s改为640!!!即可完美运行!!


打印计算量问题解决方案

我们找到如下文件'ultralytics/utils/torch_utils.py'按照如下的图片进行修改,否则容易打印不出来计算量。


注意事项!!!

如果大家在验证的时候报错形状不匹配的错误可以固定验证集的图片尺寸,方法如下 ->

找到下面这个文件ultralytics/ models /yolo/detect/train.py然后其中有一个类是DetectionTrainer class中的build_dataset函数中的一个参数rect=mode == 'val'改为rect=False


五、CSwinTransformer的yaml文件

复制如下yaml文件进行运行!!!


5.1 CSwinTransformer 的yaml文件版本1

此版本训练信息:YOLO11-CSWinTransformer summary: 472 layers, 2,486,455 parameters, 2,486,439 gradients, 5.9 GFLOPs

使用说明:# 下面 [-1, 1, LSKNet, [0.25,0.5]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv11是1.5大家根据自己训练的YOLO版本设定即可.

#  0.5对应的是模型的深度系数
# 本文支持版本有  [ 'CSWin_64_12211_tiny_224', "CSWin_64_24322_small_224", "CSWin_96_24322_base_224", "CSWin_144_24322_large_224"]

  1. # Ultralytics YOLO 🚀, AGPL-3.0 license
  2. # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
  3. # Parameters
  4. nc: 80 # number of classes
  5. scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  6. # [depth, width, max_channels]
  7. n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  8. s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  9. m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  10. l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  11. x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
  12. # 下面 [-1, 1, CSWin_64_12211_tiny_224, [0.250.5]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv111.5大家根据自己训练的YOLO版本设定即可.
  13. # 0.5对应的是模型的深度系数
  14. # 本文支持版本有 __all__ = [ 'CSWin_64_12211_tiny_224', "CSWin_64_24322_small_224", "CSWin_96_24322_base_224", "CSWin_144_24322_large_224"]
  15. # YOLO11n backbone
  16. backbone:
  17. # [from, repeats, module, args]
  18. - [-1, 1, CSWin_64_12211_tiny_224, [0.25, 0.5]] # 0-4 P1/2 这里是四层大家不要被yaml文件限制住了思维.
  19. - [-1, 1, SPPF, [1024, 5]] # 5
  20. - [-1, 2, C2PSA, [1024]] # 6
  21. # YOLO11n head
  22. head:
  23. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  24. - [[-1, 3], 1, Concat, [1]] # cat backbone P4
  25. - [-1, 2, C3k2, [512, False]] # 9
  26. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  27. - [[-1, 2], 1, Concat, [1]] # cat backbone P3
  28. - [-1, 2, C3k2, [256, False]] # 12 (P3/8-small)
  29. - [-1, 1, Conv, [256, 3, 2]]
  30. - [[-1, 9], 1, Concat, [1]] # cat head P4
  31. - [-1, 2, C3k2, [512, False]] # 15 (P4/16-medium)
  32. - [-1, 1, Conv, [512, 3, 2]]
  33. - [[-1, 6], 1, Concat, [1]] # cat head P5
  34. - [-1, 2, C3k2, [1024, True]] # 18 (P5/32-large)
  35. - [[12, 15, 18], 1, Detect, [nc]] # Detect(P3, P4, P5)


5.2 训练文件

  1. import warnings
  2. warnings.filterwarnings('ignore')
  3. from ultralytics import YOLO
  4. if __name__ == '__main__':
  5. model = YOLO('ultralytics/cfg/models/v8/yolov8-C2f-FasterBlock.yaml')
  6. # model.load('yolov8n.pt') # loading pretrain weights
  7. model.train(data=r'替换数据集yaml文件地址',
  8. # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
  9. cache=False,
  10. imgsz=640,
  11. epochs=150,
  12. single_cls=False, # 是否是单类别检测
  13. batch=4,
  14. close_mosaic=10,
  15. workers=0,
  16. device='0',
  17. optimizer='SGD', # using SGD
  18. # resume='', # 如过想续训就设置last.pt的地址
  19. amp=False, # 如果出现训练损失为Nan可以关闭amp
  20. project='runs/train',
  21. name='exp',
  22. )


六、成功运行记录

下面是成功运行的截图,已经完成了有1个epochs的训练,图片太大截不全第2个epochs,这里改完之后打印出了点问题,但是不影响任何功能,后期我找时间修复一下这个问题。

​​


七、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充, 目前本专栏免费阅读(暂时,大家尽早关注不迷路~) ,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~

​​