Skip to main content
AI Engineer World's Fair (2024)

Multimodality Track


Head back to all of our AI Engineer World's Fair recaps

Substrate Launch: the API for modular AI

Rob Cheung @perceptnet / Substrate
Watch it on YouTube | AI.Engineer Talk Details

This talk is about a new AI tool called Substrate. The speaker explains how Substrate helps developers build better AI products by connecting different AI models in a smart way, making them faster and easier to use.

⭐ Moondream: how does a tiny vision model slap so hard?

Vikhyat Korrapati @vikhyatk / M87 Labs
Watch it on YouTube | AI.Engineer Talk Details

This talk by Vic discusses the development of Moondream, a small, open-source vision language model that performs comparably to much larger models. He explains the technical approaches used to create Moondream, including the focus on synthetic data generation, and shares insights about the importance of community engagement, open-source development, and the potential for small, efficient models in the future of AI applications.

Read our Deep Dive on this talk as well.

The era of unbounded products: Designing for Multimodal I/O

Ben Hylak @benhylak / Dawn Analytics
Watch it on YouTube | AI.Engineer Talk Details

This talk by Ben Hylak, founder of Dawn, discusses the challenges and strategies for designing effective AI products, drawing parallels with other "unbounded" technologies like the Apple Vision Pro. He emphasizes the importance of adding structure, familiarity, and hierarchy to AI interfaces, and predicts future trends in AI product design, including more intuitive preset options and personalized user experiences.

State Space Models for Realtime Multimodal Intelligence

Karan Goel @krandiash / Cartesia
Watch it on YouTube | AI.Engineer Talk Details

This talk by Curran, CEO of Cartesia, focuses on the development of real-time, streaming AI models using state space models (SSMs) as an alternative to traditional transformer architectures. He discusses the challenges of efficiently handling long context and multimodal data, emphasizing the importance of compression in AI systems, and introduces Cartesia's work on creating more efficient and capable AI models for various applications, including a demonstration of their voice generation model.

The Hierarchy of Needs for Training Dataset Development

Chang She #changshe / LanceDB
Noah Shpak @ShpakNoah / CharacterAI
Watch it on YouTube | AI.Engineer Talk Details

This talk, presented by Changsha from LanceDB and Noah from Character.AI, discusses the importance of training dataset development for large language models (LLMs) and introduces LanceDB, a database format designed for multimodal AI data. They emphasize the need for efficient data management, analysis, and retrieval in AI workflows, highlighting LanceDB's features that address challenges in handling large-scale, multimodal datasets for both pre-training and fine-tuning AI models.

The Multimodal Future of Education

Stefania Druga @stefania_druga / Google
Watch it on YouTube | AI.Engineer Talk Details

This talk by Stefania Drouga discusses the future of education with multimodal AI, focusing on how AI can be used to enhance learning experiences for children and families. She presents her research on AI literacy tools like Cognimates, which allow children to program and train AI models, and demonstrates real-time interactive AI systems that can assist with science, math, and curiosity-driven learning, emphasizing the importance of making AI tinkering accessible to young learners.

How to build the world's fastest voice bot

Kwindla Kramer @kwindla / Daily
Watch it on YouTube | AI.Engineer Talk Details

This talk by Kwindla from Daily discusses the challenges and solutions in building real-time voice AI applications, focusing on architectural flexibility and low latency. He emphasizes the importance of integrating various components like audio processing, transcription, and LLM inference into a single compute container to achieve faster response times, and introduces an open-source framework called Pipecat for building multimodal AI applications.