As recently as January 2021, the challenge of "interpreting what is going on in a photograph" was considered "nowhere near solved." Today's guests Junnan Li and Dongxu Li changed that with their publication and open-sourcing of BLIP, which delivered state-of-the-art performance on image captioning and other vision-language tasks.
BLIP became the #18 most-cited AI paper of 2022, and now Junnan and Dongxu are back with BLIP-2, this time showing how small models can harness the power of existing foundation models to do multi-modal tasks.
We talked to Junnan and Dongxu about their research and how they see the trend toward connector models shaping the future.
(05:50) Convergence of AI techniques
(07:33) Evolution of BLIP to BLIP-2
(08:12) How BLIP-2 unlocked multimodal functionality
(12:43) The size, training dynamics, and optimization function of BLIP
(20:15) Practical/Business applications of BLIP
(29:43) Efficiency of BLIP-2 compared to other models
(41:52) Two-stage pre-training
(47:11) Architecture of Blip-2’s connector model
(58:52) Language models as the executive function of the brain
(01:07:32) Vision for an ultimate multimodal system and democratized pre-training for models
(01:12:59) Useful AI tools in these researchers’ day-to-day
(01:14:56) Upcoming projects
*Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
@LiJunnan0409 (Junnan Li)
Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack.com
Episode transcript at Cognitivervolution.ai
- Original BLIP demo
- BLIP 2 demo
- BLIP is the #18 most highly-cited paper in AI
- Image captioning comparison tool
- Understanding images with AI - for use in language models and image generation
- Image Aesthetics - Product & Model Reviews