YOLOv11改进 | 二次创新篇 | 可变形卷积DCNv3 升级动态尺度统一检测头DynamicHead(全网独家首发)

一、本文介绍

本文给大家带来的最新改进机制是在 DynamicHead 上替换 DCNv3 模块,其中DynamicHead的核心为 DCNv2 ,但是今年新更新了DCNv3其作为v2的升级版效果肯定是更好的, 所以我将其中的核心机制替换为DCNv3给Dyhead相当于做了一个升级 ,效果也比之前的普通版本要好,这个机制我认为是我个人融合的算是,先用先得全网无第二份此改进机制,同时我发布的一比一复现版本Dyhead也是收获了多个读者的反馈均有涨点效果,本文的DCNv3在我的数据上成功涨了三四个点,大家可以尝试效果比基础版本的Dyhead更好, 该检测头非常适合大家用来发表论文 !!

欢迎大家订阅我的专栏一起学习YOLO!


目录

一、本文介绍

二、 DCN的框架原理

三、 DynamicDCNv3Head的核心代码

四、DynamicDCNv3Head的添加方式

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

4.9 修改九

五、DynamicDCNv3Head检测头的yaml文件

六、完美运行记录

七、本文总结


二、 DCN 的框架原理

首先我们先来介绍一个大的概念 DCN 全称为Deformable Convolutional Networks,翻译过来就是可变形卷积的意思, 其是一种用于目标检测和图像分割的卷积神经网络模块, 通过引入可变形卷积操作来提升模型对目标形变的建模能力。

什么是可变形卷积?我们看下图来看一下就了解了。

上图中展示了标准卷积和可变形卷积中的采样位置。在 标准卷积(a) 中,采样位置按照规则的网格形式排列(绿色点)。这意味着卷积核在进行卷积操作时,会在输入特征图的规则网格位置上进行采样。

而在可变形卷积(b)中, 采样位置是通过引入偏移量进行变形的(深蓝色点) ,并用增强的偏移量(浅蓝色箭头)进行表示。这意味着在可变形卷积中,不再局限于规则的网格位置,而是可以根据需要在输入特征图上自由地进行采样。

通过引入可变形卷积,可以推广各种变换,例如尺度变换、(异向)长宽比和旋转等变换 ,这在(c)和(d)中进行了特殊情况的展示。这说明可变形卷积能够更灵活地适应不同类型的变换,从而增强了模型对目标形变的建模能力。

总之,标准卷积(规则采样)在进行卷积操作时按照规则网格位置进行采样,而可变形卷积通过引入偏移量来实现非规则采样,从而在形状变换(尺度、长宽比、旋转等)方面具有更强的 泛化能力

下面是一个三维的角度来分析大家应该会看的更直观。

其中左侧的是输入特征,右侧的是输出特征, 我们的卷积核大小是一个3x3的,我们将输入特征中3x3区域映射为输出特征中的1x1,问题就在于这个3x3的区域怎么选择,传统的卷积就是规则的形状,可变形卷积就是在其中加入一个偏移量,然后对于个每个点分别计算,然后改变3x3区域中每个点的选取,提取一些可能具有更丰富特征的点,从而提高检测效果。

下面我们来看一下在实际检测效果中,可变形卷积的效果,下面的图片分别为大物体、中物体、小物体检测,其中红色的部分就是我们提取出来的特征。

图中的每个图像三元组展示了三个级别的3×3可变形滤波器的采样位置(每个图像中有729个红色点),以及分别位于 背景(左侧)、小物体(中间)和大物体(右侧) 上的三个激活单元(绿色点)。

这个图示的目的是说明在不同的物体尺度上,可变形卷积中的采样位置如何变化。 在左侧的背景图像中,可变形滤波器的采样位置主要集中在图像的背景部分。在中间的小物体图像中,采样位置的焦点开始向小物体的位置移动,并在小物体周围形成更密集的采样点。在右侧的大物体图像中,采样位置进一步扩展并覆盖整个大物体,以更好地捕捉其细节和形变。

通过这些图示,我们可以观察到可变形卷积的采样位置可以根据不同的目标尺度自适应地调整,从而在不同尺度的物体上更准确地捕捉特征。这增强了模型对于不同尺度目标的感知能力,并使其更适用于不同尺度物体的检测任务 ,这也是为什么开头的地方我说了本文适合于各种目标的检测对象。

上图可能可能更加直观一些。


三、 DynamicDCNv3Head的核心代码

该代码的使用方式看章节四!

  1. import copy
  2. import math
  3. from mmcv.ops import ModulatedDeformConv2d
  4. from ultralytics.nn.modules import DFL
  5. from ultralytics.utils.tal import dist2bbox, make_anchors
  6. import warnings
  7. import torch
  8. from torch import nn
  9. import torch.nn.functional as F
  10. from torch.nn.init import xavier_uniform_, constant_
  11. __all__ = ['DynamicDCNv3Head']
  12. def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
  13. _, H_, W_, _ = spatial_shapes
  14. H_out = (H_ - (dilation_h * (kernel_h - 1) + 1)) // stride_h + 1
  15. W_out = (W_ - (dilation_w * (kernel_w - 1) + 1)) // stride_w + 1
  16. ref_y, ref_x = torch.meshgrid(
  17. torch.linspace(
  18. # pad_h + 0.5,
  19. # H_ - pad_h - 0.5,
  20. (dilation_h * (kernel_h - 1)) // 2 + 0.5,
  21. (dilation_h * (kernel_h - 1)) // 2 + 0.5 + (H_out - 1) * stride_h,
  22. H_out,
  23. dtype=torch.float32,
  24. device=device),
  25. torch.linspace(
  26. # pad_w + 0.5,
  27. # W_ - pad_w - 0.5,
  28. (dilation_w * (kernel_w - 1)) // 2 + 0.5,
  29. (dilation_w * (kernel_w - 1)) // 2 + 0.5 + (W_out - 1) * stride_w,
  30. W_out,
  31. dtype=torch.float32,
  32. device=device))
  33. ref_y = ref_y.reshape(-1)[None] / H_
  34. ref_x = ref_x.reshape(-1)[None] / W_
  35. ref = torch.stack((ref_x, ref_y), -1).reshape(
  36. 1, H_out, W_out, 1, 2)
  37. return ref
  38. def _generate_dilation_grids(spatial_shapes, kernel_h, kernel_w, dilation_h, dilation_w, group, device):
  39. _, H_, W_, _ = spatial_shapes
  40. points_list = []
  41. x, y = torch.meshgrid(
  42. torch.linspace(
  43. -((dilation_w * (kernel_w - 1)) // 2),
  44. -((dilation_w * (kernel_w - 1)) // 2) +
  45. (kernel_w - 1) * dilation_w, kernel_w,
  46. dtype=torch.float32,
  47. device=device),
  48. torch.linspace(
  49. -((dilation_h * (kernel_h - 1)) // 2),
  50. -((dilation_h * (kernel_h - 1)) // 2) +
  51. (kernel_h - 1) * dilation_h, kernel_h,
  52. dtype=torch.float32,
  53. device=device))
  54. points_list.extend([x / W_, y / H_])
  55. grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
  56. repeat(1, group, 1).permute(1, 0, 2)
  57. grid = grid.reshape(1, 1, 1, group * kernel_h * kernel_w, 2)
  58. return grid
  59. def dcnv3_core_pytorch(
  60. input, offset, mask, kernel_h,
  61. kernel_w, stride_h, stride_w, pad_h,
  62. pad_w, dilation_h, dilation_w, group,
  63. group_channels, offset_scale):
  64. # for debug and test only,
  65. # need to use cuda version instead
  66. input = F.pad(
  67. input,
  68. [0, 0, pad_h, pad_h, pad_w, pad_w])
  69. N_, H_in, W_in, _ = input.shape
  70. _, H_out, W_out, _ = offset.shape
  71. ref = _get_reference_points(
  72. input.shape, input.device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h, pad_w, stride_h, stride_w)
  73. grid = _generate_dilation_grids(
  74. input.shape, kernel_h, kernel_w, dilation_h, dilation_w, group, input.device)
  75. spatial_norm = torch.tensor([W_in, H_in]).reshape(1, 1, 1, 2).\
  76. repeat(1, 1, 1, group*kernel_h*kernel_w).to(input.device)
  77. sampling_locations = (ref + grid * offset_scale).repeat(N_, 1, 1, 1, 1).flatten(3, 4) + \
  78. offset * offset_scale / spatial_norm
  79. P_ = kernel_h * kernel_w
  80. sampling_grids = 2 * sampling_locations - 1
  81. # N_, H_in, W_in, group*group_channels -> N_, H_in*W_in, group*group_channels -> N_, group*group_channels, H_in*W_in -> N_*group, group_channels, H_in, W_in
  82. input_ = input.view(N_, H_in*W_in, group*group_channels).transpose(1, 2).\
  83. reshape(N_*group, group_channels, H_in, W_in)
  84. # N_, H_out, W_out, group*P_*2 -> N_, H_out*W_out, group, P_, 2 -> N_, group, H_out*W_out, P_, 2 -> N_*group, H_out*W_out, P_, 2
  85. sampling_grid_ = sampling_grids.view(N_, H_out*W_out, group, P_, 2).transpose(1, 2).\
  86. flatten(0, 1)
  87. # N_*group, group_channels, H_out*W_out, P_
  88. sampling_input_ = F.grid_sample(
  89. input_, sampling_grid_, mode='bilinear', padding_mode='zeros', align_corners=False)
  90. # (N_, H_out, W_out, group*P_) -> N_, H_out*W_out, group, P_ -> (N_, group, H_out*W_out, P_) -> (N_*group, 1, H_out*W_out, P_)
  91. mask = mask.view(N_, H_out*W_out, group, P_).transpose(1, 2).\
  92. reshape(N_*group, 1, H_out*W_out, P_)
  93. output = (sampling_input_ * mask).sum(-1).view(N_,
  94. group*group_channels, H_out*W_out)
  95. return output.transpose(1, 2).reshape(N_, H_out, W_out, -1).contiguous()
  96. class to_channels_first(nn.Module):
  97. def __init__(self):
  98. super().__init__()
  99. def forward(self, x):
  100. return x.permute(0, 3, 1, 2)
  101. class to_channels_last(nn.Module):
  102. def __init__(self):
  103. super().__init__()
  104. def forward(self, x):
  105. return x.permute(0, 2, 3, 1)
  106. def build_norm_layer(dim,
  107. norm_layer,
  108. in_format='channels_last',
  109. out_format='channels_last',
  110. eps=1e-6):
  111. layers = []
  112. if norm_layer == 'BN':
  113. if in_format == 'channels_last':
  114. layers.append(to_channels_first())
  115. layers.append(nn.BatchNorm2d(dim))
  116. if out_format == 'channels_last':
  117. layers.append(to_channels_last())
  118. elif norm_layer == 'LN':
  119. if in_format == 'channels_first':
  120. layers.append(to_channels_last())
  121. layers.append(nn.LayerNorm(dim, eps=eps))
  122. if out_format == 'channels_first':
  123. layers.append(to_channels_first())
  124. else:
  125. raise NotImplementedError(
  126. f'build_norm_layer does not support {norm_layer}')
  127. return nn.Sequential(*layers)
  128. def build_act_layer(act_layer):
  129. if act_layer == 'ReLU':
  130. return nn.ReLU(inplace=True)
  131. elif act_layer == 'SiLU':
  132. return nn.SiLU(inplace=True)
  133. elif act_layer == 'GELU':
  134. return nn.GELU()
  135. raise NotImplementedError(f'build_act_layer does not support {act_layer}')
  136. def _is_power_of_2(n):
  137. if (not isinstance(n, int)) or (n < 0):
  138. raise ValueError(
  139. "invalid input for _is_power_of_2: {} (type: {})".format(n, type(n)))
  140. return (n & (n - 1) == 0) and n != 0
  141. class CenterFeatureScaleModule(nn.Module):
  142. def forward(self,
  143. query,
  144. center_feature_scale_proj_weight,
  145. center_feature_scale_proj_bias):
  146. center_feature_scale = F.linear(query,
  147. weight=center_feature_scale_proj_weight,
  148. bias=center_feature_scale_proj_bias).sigmoid()
  149. return center_feature_scale
  150. class DCNv3_pytorch(nn.Module):
  151. def __init__(
  152. self,
  153. channels=64,
  154. kernel_size=3,
  155. dw_kernel_size=None,
  156. stride=1,
  157. pad=1,
  158. dilation=1,
  159. group=4,
  160. offset_scale=1.0,
  161. act_layer='GELU',
  162. norm_layer='LN',
  163. center_feature_scale=False):
  164. """
  165. DCNv3 Module
  166. :param channels
  167. :param kernel_size
  168. :param stride
  169. :param pad
  170. :param dilation
  171. :param group
  172. :param offset_scale
  173. :param act_layer
  174. :param norm_layer
  175. """
  176. super().__init__()
  177. if channels % group != 0:
  178. raise ValueError(
  179. f'channels must be divisible by group, but got {channels} and {group}')
  180. _d_per_group = channels // group
  181. dw_kernel_size = dw_kernel_size if dw_kernel_size is not None else kernel_size
  182. # you'd better set _d_per_group to a power of 2 which is more efficient in our CUDA implementation
  183. if not _is_power_of_2(_d_per_group):
  184. warnings.warn(
  185. "You'd better set channels in DCNv3 to make the dimension of each attention head a power of 2 "
  186. "which is more efficient in our CUDA implementation.")
  187. self.offset_scale = offset_scale
  188. self.channels = channels
  189. self.kernel_size = kernel_size
  190. self.dw_kernel_size = dw_kernel_size
  191. self.stride = stride
  192. self.dilation = dilation
  193. self.pad = pad
  194. self.group = group
  195. self.group_channels = channels // group
  196. self.offset_scale = offset_scale
  197. self.center_feature_scale = center_feature_scale
  198. self.dw_conv = nn.Sequential(
  199. nn.Conv2d(
  200. channels,
  201. channels,
  202. kernel_size=dw_kernel_size,
  203. stride=1,
  204. padding=(dw_kernel_size - 1) // 2,
  205. groups=channels),
  206. build_norm_layer(
  207. channels,
  208. norm_layer,
  209. 'channels_first',
  210. 'channels_last'),
  211. build_act_layer(act_layer))
  212. self.offset = nn.Linear(
  213. channels,
  214. group * kernel_size * kernel_size * 2)
  215. self.mask = nn.Linear(
  216. channels,
  217. group * kernel_size * kernel_size)
  218. self.input_proj = nn.Linear(channels, channels)
  219. self.output_proj = nn.Linear(channels, channels)
  220. self._reset_parameters()
  221. if center_feature_scale:
  222. self.center_feature_scale_proj_weight = nn.Parameter(
  223. torch.zeros((group, channels), dtype=torch.float))
  224. self.center_feature_scale_proj_bias = nn.Parameter(
  225. torch.tensor(0.0, dtype=torch.float).view((1,)).repeat(group, ))
  226. self.center_feature_scale_module = CenterFeatureScaleModule()
  227. def _reset_parameters(self):
  228. constant_(self.offset.weight.data, 0.)
  229. constant_(self.offset.bias.data, 0.)
  230. constant_(self.mask.weight.data, 0.)
  231. constant_(self.mask.bias.data, 0.)
  232. xavier_uniform_(self.input_proj.weight.data)
  233. constant_(self.input_proj.bias.data, 0.)
  234. xavier_uniform_(self.output_proj.weight.data)
  235. constant_(self.output_proj.bias.data, 0.)
  236. def forward(self, input):
  237. """
  238. :param query (N, H, W, C)
  239. :return output (N, H, W, C)
  240. """
  241. input = input.permute(0, 2, 3, 1)
  242. N, H, W, _ = input.shape
  243. x = self.input_proj(input)
  244. x_proj = x
  245. x1 = input.permute(0, 3, 1, 2)
  246. x1 = self.dw_conv(x1)
  247. offset = self.offset(x1)
  248. mask = self.mask(x1).reshape(N, H, W, self.group, -1)
  249. mask = F.softmax(mask, -1).reshape(N, H, W, -1)
  250. x = dcnv3_core_pytorch(
  251. x, offset, mask,
  252. self.kernel_size, self.kernel_size,
  253. self.stride, self.stride,
  254. self.pad, self.pad,
  255. self.dilation, self.dilation,
  256. self.group, self.group_channels,
  257. self.offset_scale)
  258. if self.center_feature_scale:
  259. center_feature_scale = self.center_feature_scale_module(
  260. x1, self.center_feature_scale_proj_weight, self.center_feature_scale_proj_bias)
  261. # N, H, W, groups -> N, H, W, groups, 1 -> N, H, W, groups, _d_per_group -> N, H, W, channels
  262. center_feature_scale = center_feature_scale[..., None].repeat(
  263. 1, 1, 1, 1, self.channels // self.group).flatten(-2)
  264. x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
  265. x = self.output_proj(x).permute(0, 3, 1, 2)
  266. return x
  267. def _make_divisible(v, divisor, min_value=None):
  268. if min_value is None:
  269. min_value = divisor
  270. new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
  271. # Make sure that round down does not go down by more than 10%.
  272. if new_v < 0.9 * v:
  273. new_v += divisor
  274. return new_v
  275. class h_swish(nn.Module):
  276. def __init__(self, inplace=False):
  277. super(h_swish, self).__init__()
  278. self.inplace = inplace
  279. def forward(self, x):
  280. return x * F.relu6(x + 3.0, inplace=self.inplace) / 6.0
  281. class h_sigmoid(nn.Module):
  282. def __init__(self, inplace=True, h_max=1):
  283. super(h_sigmoid, self).__init__()
  284. self.relu = nn.ReLU6(inplace=inplace)
  285. self.h_max = h_max
  286. def forward(self, x):
  287. return self.relu(x + 3) * self.h_max / 6
  288. class DYReLU(nn.Module):
  289. def __init__(self, inp, oup, reduction=4, lambda_a=1.0, K2=True, use_bias=True, use_spatial=False,
  290. init_a=[1.0, 0.0], init_b=[0.0, 0.0]):
  291. super(DYReLU, self).__init__()
  292. self.oup = oup
  293. self.lambda_a = lambda_a * 2
  294. self.K2 = K2
  295. self.avg_pool = nn.AdaptiveAvgPool2d(1)
  296. self.use_bias = use_bias
  297. if K2:
  298. self.exp = 4 if use_bias else 2
  299. else:
  300. self.exp = 2 if use_bias else 1
  301. self.init_a = init_a
  302. self.init_b = init_b
  303. # determine squeeze
  304. if reduction == 4:
  305. squeeze = inp // reduction
  306. else:
  307. squeeze = _make_divisible(inp // reduction, 4)
  308. # print('reduction: {}, squeeze: {}/{}'.format(reduction, inp, squeeze))
  309. # print('init_a: {}, init_b: {}'.format(self.init_a, self.init_b))
  310. self.fc = nn.Sequential(
  311. nn.Linear(inp, squeeze),
  312. nn.ReLU(inplace=True),
  313. nn.Linear(squeeze, oup * self.exp),
  314. h_sigmoid()
  315. )
  316. if use_spatial:
  317. self.spa = nn.Sequential(
  318. nn.Conv2d(inp, 1, kernel_size=1),
  319. nn.BatchNorm2d(1),
  320. )
  321. else:
  322. self.spa = None
  323. def forward(self, x):
  324. if isinstance(x, list):
  325. x_in = x[0]
  326. x_out = x[1]
  327. else:
  328. x_in = x
  329. x_out = x
  330. b, c, h, w = x_in.size()
  331. y = self.avg_pool(x_in).view(b, c)
  332. y = self.fc(y).view(b, self.oup * self.exp, 1, 1)
  333. if self.exp == 4:
  334. a1, b1, a2, b2 = torch.split(y, self.oup, dim=1)
  335. a1 = (a1 - 0.5) * self.lambda_a + self.init_a[0] # 1.0
  336. a2 = (a2 - 0.5) * self.lambda_a + self.init_a[1]
  337. b1 = b1 - 0.5 + self.init_b[0]
  338. b2 = b2 - 0.5 + self.init_b[1]
  339. out = torch.max(x_out * a1 + b1, x_out * a2 + b2)
  340. elif self.exp == 2:
  341. if self.use_bias: # bias but not PL
  342. a1, b1 = torch.split(y, self.oup, dim=1)
  343. a1 = (a1 - 0.5) * self.lambda_a + self.init_a[0] # 1.0
  344. b1 = b1 - 0.5 + self.init_b[0]
  345. out = x_out * a1 + b1
  346. else:
  347. a1, a2 = torch.split(y, self.oup, dim=1)
  348. a1 = (a1 - 0.5) * self.lambda_a + self.init_a[0] # 1.0
  349. a2 = (a2 - 0.5) * self.lambda_a + self.init_a[1]
  350. out = torch.max(x_out * a1, x_out * a2)
  351. elif self.exp == 1:
  352. a1 = y
  353. a1 = (a1 - 0.5) * self.lambda_a + self.init_a[0] # 1.0
  354. out = x_out * a1
  355. if self.spa:
  356. ys = self.spa(x_in).view(b, -1)
  357. ys = F.softmax(ys, dim=1).view(b, 1, h, w) * h * w
  358. ys = F.hardtanh(ys, 0, 3, inplace=True)/3
  359. out = out * ys
  360. return out
  361. class Conv3x3Norm(torch.nn.Module):
  362. def __init__(self, in_channels, out_channels, stride):
  363. super(Conv3x3Norm, self).__init__()
  364. self.stride = stride
  365. self.dcnv3 = DCNv3_pytorch(in_channels, kernel_size=3, stride=stride)
  366. self.dcnv2 = ModulatedDeformConv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
  367. self.bn = nn.GroupNorm(num_groups=16, num_channels=out_channels)
  368. def forward(self, input, **kwargs):
  369. if self.stride == 2:
  370. x = self.dcnv2(input.contiguous(), **kwargs)
  371. else:
  372. x = self.dcnv3(input)
  373. x = self.bn(x)
  374. return x
  375. class DyConv(nn.Module):
  376. def __init__(self, in_channels=256, out_channels=256, conv_func=Conv3x3Norm):
  377. super(DyConv, self).__init__()
  378. self.DyConv = nn.ModuleList()
  379. self.DyConv.append(conv_func(in_channels, out_channels, 1))
  380. self.DyConv.append(conv_func(in_channels, out_channels, 1))
  381. self.DyConv.append(conv_func(in_channels, out_channels, 2))
  382. self.AttnConv = nn.Sequential(
  383. nn.AdaptiveAvgPool2d(1),
  384. nn.Conv2d(in_channels, 1, kernel_size=1),
  385. nn.ReLU(inplace=True))
  386. self.h_sigmoid = h_sigmoid()
  387. self.relu = DYReLU(in_channels, out_channels)
  388. self.offset = nn.Conv2d(in_channels, 27, kernel_size=3, stride=1, padding=1)
  389. self.init_weights()
  390. def init_weights(self):
  391. for m in self.DyConv.modules():
  392. if isinstance(m, nn.Conv2d):
  393. nn.init.normal_(m.weight.data, 0, 0.01)
  394. if m.bias is not None:
  395. m.bias.data.zero_()
  396. for m in self.AttnConv.modules():
  397. if isinstance(m, nn.Conv2d):
  398. nn.init.normal_(m.weight.data, 0, 0.01)
  399. if m.bias is not None:
  400. m.bias.data.zero_()
  401. def forward(self, x):
  402. next_x = {}
  403. feature_names = list(x.keys())
  404. for level, name in enumerate(feature_names):
  405. feature = x[name]
  406. offset_mask = self.offset(feature)
  407. offset = offset_mask[:, :18, :, :]
  408. mask = offset_mask[:, 18:, :, :].sigmoid()
  409. conv_args = dict(offset=offset, mask=mask)
  410. temp_fea = [self.DyConv[1](feature, **conv_args)]
  411. if level > 0:
  412. temp_fea.append(self.DyConv[2](x[feature_names[level - 1]], **conv_args))
  413. if level < len(x) - 1:
  414. input = x[feature_names[level + 1]]
  415. temp_fea.append(F.interpolate(self.DyConv[0](input, **conv_args),
  416. size=[feature.size(2), feature.size(3)]))
  417. attn_fea = []
  418. res_fea = []
  419. for fea in temp_fea:
  420. res_fea.append(fea)
  421. attn_fea.append(self.AttnConv(fea))
  422. res_fea = torch.stack(res_fea)
  423. spa_pyr_attn = self.h_sigmoid(torch.stack(attn_fea))
  424. mean_fea = torch.mean(res_fea * spa_pyr_attn, dim=0, keepdim=False)
  425. next_x[name] = self.relu(mean_fea)
  426. return next_x
  427. def autopad(k, p=None, d=1): # kernel, padding, dilation
  428. """Pad to 'same' shape outputs."""
  429. if d > 1:
  430. k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
  431. if p is None:
  432. p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
  433. return p
  434. class Conv(nn.Module):
  435. """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
  436. default_act = nn.SiLU() # default activation
  437. def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
  438. """Initialize Conv layer with given arguments including activation."""
  439. super().__init__()
  440. self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
  441. self.bn = nn.BatchNorm2d(c2)
  442. self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
  443. def forward(self, x):
  444. """Apply convolution, batch normalization and activation to input tensor."""
  445. return self.act(self.bn(self.conv(x)))
  446. def forward_fuse(self, x):
  447. """Perform transposed convolution of 2D data."""
  448. return self.act(self.conv(x))
  449. class DWConv(Conv):
  450. """Depth-wise convolution."""
  451. def __init__(self, c1, c2, k=1, s=1, d=1, act=True): # ch_in, ch_out, kernel, stride, dilation, activation
  452. """Initialize Depth-wise convolution with given parameters."""
  453. super().__init__(c1, c2, k, s, g=math.gcd(c1, c2), d=d, act=act)
  454. class DynamicDCNv3Head(nn.Module):
  455. """YOLOv8 Detect head for detection models. CSDNSnu77"""
  456. dynamic = False # force grid reconstruction
  457. export = False # export mode
  458. end2end = False # end2end
  459. max_det = 300 # max_det
  460. shape = None
  461. anchors = torch.empty(0) # init
  462. strides = torch.empty(0) # init
  463. def __init__(self, nc=80, ch=()):
  464. """Initializes the YOLOv8 detection layer with specified number of classes and channels."""
  465. super().__init__()
  466. self.nc = nc # number of classes
  467. self.nl = len(ch) # number of detection layers
  468. self.reg_max = 16 # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
  469. self.no = nc + self.reg_max * 4 # number of outputs per anchor
  470. self.stride = torch.zeros(self.nl) # strides computed during build
  471. c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100)) # channels
  472. self.cv2 = nn.ModuleList(
  473. nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
  474. )
  475. self.cv3 = nn.ModuleList(
  476. nn.Sequential(
  477. nn.Sequential(DWConv(x, x, 3), Conv(x, c3, 1)),
  478. nn.Sequential(DWConv(c3, c3, 3), Conv(c3, c3, 1)),
  479. nn.Conv2d(c3, self.nc, 1),
  480. )
  481. for x in ch
  482. )
  483. self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()
  484. dyhead_tower = []
  485. for i in range(self.nl):
  486. channel = ch[i]
  487. dyhead_tower.append(
  488. DyConv(
  489. channel,
  490. channel,
  491. conv_func=Conv3x3Norm,
  492. )
  493. )
  494. self.add_module('dyhead_tower', nn.Sequential(*dyhead_tower))
  495. if self.end2end:
  496. self.one2one_cv2 = copy.deepcopy(self.cv2)
  497. self.one2one_cv3 = copy.deepcopy(self.cv3)
  498. def forward(self, x):
  499. tensor_dict = {i: tensor for i, tensor in enumerate(x)}
  500. x = self.dyhead_tower(tensor_dict)
  501. x = list(x.values())
  502. """Concatenates and returns predicted bounding boxes and class probabilities."""
  503. if self.end2end:
  504. return self.forward_end2end(x)
  505. for i in range(self.nl):
  506. x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
  507. if self.training: # Training path
  508. return x
  509. y = self._inference(x)
  510. return y if self.export else (y, x)
  511. def forward_end2end(self, x):
  512. """
  513. Performs forward pass of the v10Detect module.
  514. Args:
  515. x (tensor): Input tensor.
  516. Returns:
  517. (dict, tensor): If not in training mode, returns a dictionary containing the outputs of both one2many and one2one detections.
  518. If in training mode, returns a dictionary containing the outputs of one2many and one2one detections separately.
  519. """
  520. x_detach = [xi.detach() for xi in x]
  521. one2one = [
  522. torch.cat((self.one2one_cv2[i](x_detach[i]), self.one2one_cv3[i](x_detach[i])), 1) for i in range(self.nl)
  523. ]
  524. for i in range(self.nl):
  525. x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
  526. if self.training: # Training path
  527. return {"one2many": x, "one2one": one2one}
  528. y = self._inference(one2one)
  529. y = self.postprocess(y.permute(0, 2, 1), self.max_det, self.nc)
  530. return y if self.export else (y, {"one2many": x, "one2one": one2one})
  531. def _inference(self, x):
  532. """Decode predicted bounding boxes and class probabilities based on multiple-level feature maps."""
  533. # Inference path
  534. shape = x[0].shape # BCHW
  535. x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
  536. if self.dynamic or self.shape != shape:
  537. self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
  538. self.shape = shape
  539. if self.export and self.format in {"saved_model", "pb", "tflite", "edgetpu", "tfjs"}: # avoid TF FlexSplitV ops
  540. box = x_cat[:, : self.reg_max * 4]
  541. cls = x_cat[:, self.reg_max * 4 :]
  542. else:
  543. box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
  544. if self.export and self.format in {"tflite", "edgetpu"}:
  545. # Precompute normalization factor to increase numerical stability
  546. # See https://github.com/ultralytics/ultralytics/issues/7371
  547. grid_h = shape[2]
  548. grid_w = shape[3]
  549. grid_size = torch.tensor([grid_w, grid_h, grid_w, grid_h], device=box.device).reshape(1, 4, 1)
  550. norm = self.strides / (self.stride[0] * grid_size)
  551. dbox = self.decode_bboxes(self.dfl(box) * norm, self.anchors.unsqueeze(0) * norm[:, :2])
  552. else:
  553. dbox = self.decode_bboxes(self.dfl(box), self.anchors.unsqueeze(0)) * self.strides
  554. return torch.cat((dbox, cls.sigmoid()), 1)
  555. def bias_init(self):
  556. """Initialize Detect() biases, WARNING: requires stride availability."""
  557. m = self # self.model[-1] # Detect() module
  558. # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1
  559. # ncf = math.log(0.6 / (m.nc - 0.999999)) if cf is None else torch.log(cf / cf.sum()) # nominal class frequency
  560. for a, b, s in zip(m.cv2, m.cv3, m.stride): # from
  561. a[-1].bias.data[:] = 1.0 # box
  562. b[-1].bias.data[: m.nc] = math.log(5 / m.nc / (640 / s) ** 2) # cls (.01 objects, 80 classes, 640 img)
  563. if self.end2end:
  564. for a, b, s in zip(m.one2one_cv2, m.one2one_cv3, m.stride): # from
  565. a[-1].bias.data[:] = 1.0 # box
  566. b[-1].bias.data[: m.nc] = math.log(5 / m.nc / (640 / s) ** 2) # cls (.01 objects, 80 classes, 640 img)
  567. def decode_bboxes(self, bboxes, anchors):
  568. """Decode bounding boxes."""
  569. return dist2bbox(bboxes, anchors, xywh=not self.end2end, dim=1)
  570. @staticmethod
  571. def postprocess(preds: torch.Tensor, max_det: int, nc: int = 80):
  572. """
  573. Post-processes YOLO model predictions.
  574. Args:
  575. preds (torch.Tensor): Raw predictions with shape (batch_size, num_anchors, 4 + nc) with last dimension
  576. format [x, y, w, h, class_probs].
  577. max_det (int): Maximum detections per image.
  578. nc (int, optional): Number of classes. Default: 80.
  579. Returns:
  580. (torch.Tensor): Processed predictions with shape (batch_size, min(max_det, num_anchors), 6) and last
  581. dimension format [x, y, w, h, max_class_prob, class_index].
  582. """
  583. batch_size, anchors, _ = preds.shape # i.e. shape(16,8400,84)
  584. boxes, scores = preds.split([4, nc], dim=-1)
  585. index = scores.amax(dim=-1).topk(min(max_det, anchors))[1].unsqueeze(-1)
  586. boxes = boxes.gather(dim=1, index=index.repeat(1, 1, 4))
  587. scores = scores.gather(dim=1, index=index.repeat(1, 1, nc))
  588. scores, index = scores.flatten(1).topk(min(max_det, anchors))
  589. i = torch.arange(batch_size)[..., None] # batch indices
  590. return torch.cat([boxes[i, index // nc], scores[..., None], (index % nc)[..., None].float()], dim=-1)
  591. if __name__ == "__main__":
  592. # Generating Sample image
  593. image1 = (1, 64, 32, 32)
  594. image2 = (1, 64, 16, 16)
  595. image3 = (1, 64, 8, 8)
  596. image1 = torch.rand(image1)
  597. image2 = torch.rand(image2)
  598. image3 = torch.rand(image3)
  599. image = [image1, image2, image3]
  600. channel = (64, 64, 64)
  601. # Model
  602. mobilenet_v1 = DynamicDCNv3Head(nc=80, ch=channel)
  603. out = mobilenet_v1(image)
  604. print(out)


四、DynamicDCNv3Head的添加方式

4.1 修改一

首先我们将上面的代码复制粘贴到' ultralytics /nn' 目录下新建一个py文件复制粘贴进去,具体名字自己来定.


4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( ,然后在其内部导入我们的检测头如下图所示。


4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( !

​​​


4.4 修改四

第四步我门找到如下文件'ultralytics/nn/tasks.py,找到如下的代码进行将检测头添加进去,这里给大家推荐个快速搜索的方法用ctrl+f然后搜索Detect然后就能快速查找了。

​​​​


4.5 修改五

按照红框添加.


4.6 修改六

同理


4.7 修改七

这里有一些不一样,我们需要加一行代码

  1. else:
  2. return 'detect'

为啥呢不一样,因为这里的m在代码执行过程中会将你的代码自动转换为小写,所以直接else方便一点,以后出现一些其它分割或者其它的教程的时候在提供其它的修改教程。

​​​​


4.8 修改八

同理.

​​​​


4.9 修改九

找到'ultralytics/engine/validator.py'文件找到 'class BaseValidator:' 然后在其'__call__'中
self.args.half = self.device.type != 'cpu'  # force FP16 val during training的一行代码下面加上self.args.half = False


到此就修改完成了,大家可以复制下面的yaml文件运行。


五、DynamicDCNv3Head检测头的yaml文件

此版本训练信息:YOLO11-DynamicDCNv3Head summary: 498 layers, 2,416,931 parameters, 2,416,915 gradients, 7.2 GFLOPs

  1. # Ultralytics YOLO 🚀, AGPL-3.0 license
  2. # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
  3. # Parameters
  4. nc: 80 # number of classes
  5. scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  6. # [depth, width, max_channels]
  7. n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  8. s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  9. m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  10. l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  11. x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
  12. # YOLO11n backbone
  13. backbone:
  14. # [from, repeats, module, args]
  15. - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  16. - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  17. - [-1, 2, C3k2, [256, False, 0.25]]
  18. - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  19. - [-1, 2, C3k2, [512, False, 0.25]]
  20. - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  21. - [-1, 2, C3k2, [512, True]]
  22. - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  23. - [-1, 2, C3k2, [1024, True]]
  24. - [-1, 1, SPPF, [1024, 5]] # 9
  25. - [-1, 2, C2PSA, [1024]] # 10
  26. # YOLO11n head
  27. head:
  28. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  29. - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  30. - [-1, 2, C3k2, [256, False]] # 13
  31. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  32. - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  33. - [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)
  34. - [-1, 1, Conv, [256, 3, 2]]
  35. - [[-1, 13], 1, Concat, [1]] # cat head P4
  36. - [-1, 2, C3k2, [256, False]] # 19 (P4/16-medium)
  37. - [-1, 1, Conv, [512, 3, 2]]
  38. - [[-1, 10], 1, Concat, [1]] # cat head P5
  39. - [-1, 2, C3k2, [256, True]] # 22 (P5/32-large)
  40. - [[16, 19, 22], 1, DynamicDCNv3Head, [nc]] # Detect(P3, P4, P5)


六、完美运行记录

最后提供一下完美运行的图片。

​​

​​


七、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~