1. Gabatarwa
Vision Transformers (ViTs) sun kawo juyin juya hali a fannin hangen nesa ta hanyar ƙwarewarsu ta koyon wakilci. Duk da haka, ƙwarewarsu ta lissafi mai siffar murabba'i dangane da tsarin token yana haifar da ƙalubale masu yawa don turawa akan na'urorin da ke da ƙarancin albarkatu. Wannan takarda tana magance gibi biyu masu muhimmanci: rashin haɗin binciken da ke rarraba hanyoyin matsatsawar token da kuma ƙarancin kimanta waɗannan hanyoyin akan ƙananan tsare-tsaren transformer.
2. Rarrabawar Matsatsawar Token
Za a iya rarraba dabarun matsatsawar token bisa ga dabarunsu na asali da buƙatun turawa.
2.1 Hanyoyin Dangane da Yankewa
Hanyoyin yankewa suna cire token marasa mahimmanci bisa ga maki muhimmanci. DynamicViT da SPViT suna amfani da masu hasashen da ake iya koyi don tantance muhimmancin token, yayin da EViT da ATS ke amfani da hanyoyin tantancewa.
2.2 Hanyoyin Dangane da Haɗawa
Fasahohin haɗawa suna haɗa token da yawa zuwa gaɓoɓin wakilci. ToMe da PiToMe suna amfani da dabarun haɗawa mai wuya, yayin da SiT da Sinkhorn ke amfani da hanyoyin matsakaita masu nauyi.
2.3 Hanyoyin Haɗaɗɗe
Hanyoyin haɗaɗɗe kamar ToFu da DiffRate suna haɗa dabarun yankewa da haɗawa don cimma matsakaicin matsatsawa yayin kiyaye aikin samfurin.
3. Tsarin Fasaha
3.1 Tsarin Lissafi
Matsalar matsatsawar token za a iya tsara ta azaman inganta ma'amala tsakanin ingantaccen lissafi da aikin samfurin. Idan aka ba da shigarwar token $X = \{x_1, x_2, ..., x_N\}$, manufar ita ce samar da matsatattun token $X' = \{x'_1, x'_2, ..., x'_M\}$ inda $M < N$, yayin rage lalacewar aiki.
Tsarin kulawa a cikin ViTs na yau da kullun yana da sarkakiya $O(N^2d)$ inda $N$ ke nisan tsari kuma $d$ yana da girman saka. Matsatsawar token tana rage wannan zuwa $O(M^2d)$ ko mafi kyau.
3.2 Cikakkun Bayanai na Aiwartawa
Za a iya saka sassan matsatsawar token a yankuna daban-daban na tsarin transformer. Matsatsawar farko tana adana mafi yawan ceton lissafi amma tana iya cire mahimman bayanai, yayin da jinkirin matsatsawa yana kiyaye daidaito a farashin rage ribar inganci.
4. Kimantawar Gwaji
4.1 Aikin ViT na Yau da Kullun
A kan tsare-tsaren ViT na yau da kullun (ViT-B, ViT-L), hanyoyin matsatsawar token suna cimma ragewar FLOPs 30-50% tare da ƙarancin faɗuwar daidaito (yawanci <1% akan ImageNet). Hanyoyin masu ƙarfi kamar SPViT suna nuna mafi kyawun ma'amala tsakanin daidaito da inganci idan aka kwatanta da hanyoyin tsayayye.
4.2 Aikin Ƙananan ViT
Lokacin da aka yi amfani da su ga ƙananan ViTs (AutoFormer, ElasticViT), hanyoyin matsatsawar token suna nuna rage tasiri. Tsare-tsaren da aka matsa sun riga sun inganta wakilcin token, wanda ke sa ƙarin matsatsawa ya zama ƙalubale ba tare da gagarumin lalacewar daidaito ba.
4.3 Ma'aunin Turawa na Edge
Kimantawa akan na'urorin edge ya nuna cewa matsatsawar token na iya rage jinkirin ƙaddamarwa da 25-40% da kuma amfani da ƙwaƙwalwar ajiya da 30-50%, yana sa ViTs su zama mafi dacewa ga aikace-aikacen ainihi akan tsarin wayar hannu da na haɗaka.
5. Aiwartawar Code
A ƙasa akwai sauƙaƙaƙen aiwatar da Python na haɗin token ta amfani da hanyar ToMe:
import torch
import torch.nn as nn
class TokenMerging(nn.Module):
def __init__(self, dim, reduction_ratio=0.5):
super().__init__()
self.dim = dim
self.reduction_ratio = reduction_ratio
def forward(self, x):
# x: [B, N, C]
B, N, C = x.shape
M = int(N * self.reduction_ratio)
# Compute token similarity
similarity = torch.matmul(x, x.transpose(-1, -2)) # [B, N, N]
# Select top-k tokens to keep
values, indices = torch.topk(similarity.mean(dim=-1), M, dim=-1)
# Merge similar tokens
compressed_x = x.gather(1, indices.unsqueeze(-1).expand(-1, -1, C))
return compressed_x6. Aikace-aikacen Gaba
Fasahohin matsatsawar token suna nuna alƙawari ga aikace-aikacen AI na edge daban-daban ciki har da nazarin bidiyo na ainihi, tsarin tuƙi mai cin gashin kansa, da aikace-aikacen hangen nesa na wayar hannu. Bincike na gaba ya kamata ya mayar da hankali ga matsakaicin matsatsawa masu daidaitawa waɗanda ke daidaitawa bisa ga sarkakiya ta shigarwa da ƙuntatawar kayan aiki. Haɗawa da binciken tsarin jijiya (NAS) zai iya haifar da ingantattun dabarun matsatsawa waɗanda aka keɓance ga takamaiman yanayin turawa.
7. Nassoshi
- Dosovitskiy et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021.
- Wang et al. "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions." ICCV 2021.
- Liu et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021.
- Chen et al. "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification." NeurIPS 2021.
- Bolya et al. "Token Merging for Fast Stable Diffusion." CVPR 2023.
Bincike na Asali
Wannan cikakken bincike kan matsatsawar token don Vision Transformers yana wakiltar gagarumin gudummawa ga fannin ingantaccen koyon zurfin. Marubutan sun magance gibi mai mahimmanci a cikin wallafe-wallafen ta hanyar kimanta waɗannan fasahohin ba kawai akan tsare-tsaren ViT na yau da kullun ba har ma akan bambance-bambancen da aka ƙera don turawa na edge. Wannan hanyar kimantawa biyu ta bayyana mahimman bayanai: yayin da hanyoyin matsatsawar token suka cimma nasarorin inganci masu ban sha'awa akan ViTs na gaba ɗaya (ragewar FLOPs 30-50% tare da ƙarancin asarar daidaito), tasirin su yana raguwa lokacin da aka yi amfani da su ga tsare-tsaren da aka riga aka matsa. Wannan binciken ya yi daidai da abin da aka lura daga wasu fannonin matsatsawar samfurin, inda fasahohin ingantawa da aka haɗa sau da yawa ke nuna raguwar dawowa.
Rarrabawar da aka gabatar a cikin Tebur na I yana ba da tsari mai mahimmanci don fahimtar yanayin hanyoyin matsatsawar token. Rarrabuwa ta hanyar matsatsawa (yankewa, haɗawa, haɗaɗɗe) da nau'in raguwa (tsayayye, mai ƙarfi, wuya, mai laushi) yana ba masu bincike da masu aiki taswira bayyananne don zaɓar dabarun da suka dace bisa takamaiman buƙatunsu. Haɗa buƙatun horo yana da amfani musamman ga yanayin turawa inda ba za a iya yin gyare-gyare ba.
Daga mahangar fasaha, tsarin lissafi na matsatsawar token a matsayin matsala ta ingantawa tsakanin ingantaccen lissafi da aikin samfurin yana juyar da irin wannan ma'amalar da aka bincika a wasu fannonin hangen nesa. Alal misali, fasahohin ci gaba da girma a cikin StyleGAN da hanyoyin kulawa a cikin DETR suna nuna irin wannan ma'amalar daidaitawa tsakanin sarkakiya ta samfurin da aiki. Rage sarkakiya mai siffar murabba'i daga $O(N^2d)$ zuwa $O(M^2d)$ yana kwatanta ribar inganci da aka samu a cikin hanyoyin kulawa marasa kyau, kamar yadda aka gani a cikin samfura kamar Longformer da BigBird don sarrafa harshe na halitta.
Binciken gwaji game da rage tasiri akan ƙananan ViTs yana nuna muhimmin alkibla na bincike. Kamar yadda aka lura a cikin takardar CycleGAN ta asali da aikin gaba akan ingantattun GANs, ingantattun gine-gine sau da yawa suna haifar da haɗaɗɗun sassan inda ƙarin matsatsawa ke buƙatar sake yin la'akari gabaɗaya maimakon aikace-aikacen yanki na fasahohin da ake da su. Wannan yana nuna cewa aikin gaba ya kamata ya mayar da hankali ga hanyoyin haɗin gwiwa inda ake haɗa dabarun matsatsawar token yayin lokacin binciken gine-gine maimakon aiwatar da su azaman matakan bayan aiwatarwa.
Abubuwan da suka shafi turawa na edge suna da girma. Tare da girma mahimmancin sarrafa AI akan na'ura don aikace-aikace daga motocin cin gashin kansu zuwa kiwon lafiya na wayar hannu, fasahohin da za su iya sa tsare-tsaren transformer su zama masu amfani akan kayan aikin da ke da ƙarancin albarkatu suna da ƙima sosai. Ragewar jinkiri 25-40% da aka ruwaito da ceton ƙwaƙwalwar ajiya 30-50% na iya zama bambanci tsakanin yuwuwar da rashin yuwuwar turawa a yawancin yanayi na ainihi.
Idan muka duba gaba, haɗa matsatsawar token tare da binciken tsarin jijiya, kamar yadda aka nuna a sashen aikace-aikacen gaba, yana wakiltar alkibla mai ban sha'awa. Kama da juyin halitta na matsatsawar samfurin a cikin hanyoyin sadarwa, inda fasahohi kamar NetAdapt da AMC suka nuna fa'idodin ingantawa na sanin kayan aiki, za mu iya sa ran ganin ƙarin mayar da hankali ga ingantaccen ingantawa na ƙarshe-zuwa-ƙarshe na tsare-tsaren transformer don takamaiman ƙuntatawa na turawa. Sabon fannin binciken tsarin jijiya mai banbanta (DNAS) zai iya samar da tushen fasaha don koyon ingantattun dabarun matsatsawa kai tsaye daga manufofin turawa.