I have largely worked in computer vision since in a sense getting data and making sense of it from images/videos etc interested me much than NLP tasks . The result has been my interest in classical computer vision ,deep learning and CNN .
The recent (not so recent) papers using some of NLP mechanism for vision tasks has now motivated to at least understand them with transformers , attention mechanisms and MLP showing improvements in classification tasks at the very least .
The take away for me from both these papers is the same , when we had smaller datasets and the architecture of CNN has been dominant in the computer vision tasks because it took the idea of images that a pixel next to another pixel more important . The bias we created with CNN is useful for images especially for smaller dataset of that time . Both Vit and MLP-mixer are comparatively general architectures compared to CNN and with ability of current compute and datasets these learn the features just like those of CNN and moreover perform better since they have learn biases which could not think of and introduce since attention mechanism and token and channel mixing are more general but need more data and scale better as expected . MLP definitely is lot faster than Vit due to no softmax function and attention which have both have a computational requirement and MLP has higher throughput for applications which need those.
Comentários