As recently as January 2021, the challenge of "interpreting what is going on in a photograph" was considered "nowhere near solved." Today's guests Junnan Li and Dongxu Li changed that with their publication and open-sourcing of BLIP, which delivered state-of-the-art performance on image captioning and other vision-language tasks.
BLIP became the #18 most-cited AI paper of 2022, and now Junnan and Dongxu are back with BLIP-2, this time showing how small models can harness the power of existing foundation models to do multi-modal tasks.
We talked to Junnan and Dongxu about their research and how they see the trend toward connector models shaping the future.
The HR industry is at a crossroads. What will it take to construct the next generation of incredible businesses – and where can people leaders have the most business impact? Hosts Nolan Church and Kelli Dragovich have been through it all, the highs and the lows – IPOs, layoffs, executive turnover, board meetings, culture changes, and more. With a lineup of industry vets and experts, Nolan and Kelli break down the nitty-gritty details, trade offs, and dynamics of constructing high performing companies. Through unfiltered conversations that can only happen between seasoned practitioners, Kelli and Nolan dive deep into the kind of leadership-level strategy that often happens behind closed doors. Check out the first episode with the architect of Netflix’s culture deck Patty McCord.
(05:50) Convergence of AI techniques
(07:33) Evolution of BLIP to BLIP-2
(08:12) How BLIP-2 unlocked multimodal functionality
(12:43) The size, training dynamics, and optimization function of BLIP
(20:15) Practical/Business applications of BLIP
(29:43) Efficiency of BLIP-2 compared to other models
(41:52) Two-stage pre-training
(47:11) Architecture of Blip-2’s connector model
(58:52) Language models as the executive function of the brain
(01:07:32) Vision for an ultimate multimodal system and democratized pre-training for models
(01:12:59) Useful AI tools in these researchers’ day-to-day
(01:14:56) Upcoming projects
*Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
@LiJunnan0409 (Junnan Li)
Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack.com
Episode transcript at Cognitivervolution.ai
- Original BLIP demo
- BLIP 2 demo
- BLIP is the #18 most highly-cited paper in AI
- Image captioning comparison tool
- Understanding images with AI - for use in language models and image generation
- Image Aesthetics - Product & Model Reviews