E6: The Computer Vision Revolution with Junnan Li and Dongxu Li of BLIP and BLIP2
Play • 1 hr 21 min

As recently as January 2021, the challenge of "interpreting what is going on in a photograph" was considered "nowhere near solved." Today's guests Junnan Li and Dongxu Li changed that with their publication and open-sourcing of BLIP, which delivered state-of-the-art performance on image captioning and other vision-language tasks.

BLIP became the #18 most-cited AI paper of 2022, and now Junnan and Dongxu are back with BLIP-2, this time showing how small models can harness the power of existing foundation models to do multi-modal tasks.

We talked to Junnan and Dongxu about their research and how they see the trend toward connector models shaping the future.


(00:00) Preview

(01:17) Sponsor

(01:35) Intro

(05:50) Convergence of AI techniques

(07:33) Evolution of BLIP to BLIP-2

(08:12) How BLIP-2 unlocked multimodal functionality

(12:43) The size, training dynamics, and optimization function of BLIP

(20:15) Practical/Business applications of BLIP

(29:43) Efficiency of BLIP-2 compared to other models

(41:52) Two-stage pre-training

(47:11) Architecture of Blip-2’s connector model

(58:52) Language models as the executive function of the brain

(01:07:32) Vision for an ultimate multimodal system and democratized pre-training for models

(01:12:59) Useful AI tools in these researchers’ day-to-day

(01:14:56) Upcoming projects


*Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.




@LiJunnan0409 (Junnan Li)

@DongxuLi_(Dongxu Li)

@labenz (Nathan)

@eriktorenberg (Erik)


Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack.com



Episode transcript at Cognitivervolution.ai




Show Notes:

- Original BLIP demo

- BLIP 2 demo

- BLIP is the #18 most highly-cited paper in AI

- Image captioning comparison tool

- Understanding images with AI - for use in language models and image generation

- Image Aesthetics - Product & Model Reviews

More episodes
Clear search
Close search
Google apps
Main menu