Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings
In this presentation, I will argue that we can make progress in language/vision tasks if we represent images in structured ways, rather than just labeling objects, actions, or attributes. In particular, deploying structured representations from natural language processing is fruitful: I will discuss how visual dependency representations (VDRs), which borrow ideas for dependency parsing, can be used to capture how the objects in an scene interact with each other. VDRs are useful for tasks such as image retrieval or image description. Secondly, I will argue that much more fine-grained representations of actions are needed for most language/vision tasks. Again, ideas from NLP are be leveraged: I will introduce algorithms that use multimodal embeddings to perform verb sense disambiguation in a visual context.