In this presentation, I will argue that we can make progress in language/vision tasks if we represent images in structured ways, rather than just labeling objects, actions, or attributes. In particular, deploying structured representations from natural language processing is fruitful: I will discuss how visual dependency representations (VDRs), which borrow ideas for dependency parsing, can be used to capture how the objects in an scene interact with each other. VDRs are useful for tasks such as image retrieval or image description. Secondly, I will argue that much more fine-grained representations of actions are needed for most language/vision tasks. Again, ideas from NLP are be leveraged: I will introduce algorithms that use multimodal embeddings to perform verb sense disambiguation in a visual context.
This video may be embedded in other websites. You must copy the embeding code and paste it in the desired location in the HTML text of a Web page. Please always include the source and point it to lecture2go!
Please click on the link bellow and then fill out the required fields to contact our Support Team! RRZ Support Link