What's AI Episode 25: Jerry Liu. From RAG Strategies to Gemini's Impact in Tech

The What's AI podcast episode 25 with Jerry Liu: LlamaIndex CEO and co-founder

What's AI Episode 25: Jerry Liu. From RAG Strategies to Gemini's Impact in Tech

In this insightful episode of the What's AI podcast, I, Louis-François Bouchard, had a fantastic conversation with Jerry Liu, the CEO of LlamaIndex. We explore the complexities and challenges of AI technology, focusing on Retrieval Augmented Generation (RAG), the importance of great documentation, and the potential of emerging multimodal models like Gemini.

Jerry shares his expert insights on handling diverse data formats, particularly the perpetual challenge of processing PDFs. He discusses how multimodal models might offer groundbreaking solutions to these longstanding issues. We also touch on the critical role of clear and comprehensive documentation in the tech industry. Drawing from his experience at LlamaIndex, Jerry emphasizes the importance of creating documentation that is not just informative but also serves as a developmental journey for developers, guiding them from basic concepts to advanced applications in AI.

Our conversation takes a deep dive into the world of RAG, where Jerry explains its simplicity, efficiency, and practicality, especially for enterprise applications. He contrasts RAG with fine-tuning models, helping us identify "when to use which". This discussion is particularly enlightening for those implementing and optimizing LLM-based applications.

We also explore the technical aspects of chunking strategies and data quality in RAG systems. Jerry's insights into embedding models, fine-tuning for specific domains, and optimizing retrieval processes are invaluable for anyone looking to implement or improve RAG systems in their projects.

My conversation with Jerry makes complex AI concepts accessible to a wide audience, whether you're deeply embedded in the field of AI or just starting to explore its possibilities.

Don't forget to leave a like or a 5-star review to support the podcast! If you find this episode as enlightening as I did, please share it with your friends and colleagues who are keen on staying updated with the latest developments in AI. Let's dive in this episode on Spotify, Apple Podcasts, or YouTube:

Full Transcript:

Jerry Liu: [00:00:00] I think Gemini is, is very exciting. It's a multimodal model. If you take a look at some of the demo videos, it interleaves text, image, video, reasoning, audio. And one of the exciting aspects is the multimodality. And the other aspect seems to be that the latency is pretty fast. I think what's exciting to me is not necessarily.

Whether it completely outperforms GPT 4, right now, what I've seen is it seems to have like some incremental improvement, but not necessarily like a huge step change. If you think about it, RAG is basically prompt engineering because you're basically figuring out a way to put context into the prompt.

It's just a programmatic way of prompt engineering. if we think RAG will get better than that, we could see that as a way of prompt engineering. It's a way of prompting so that you actually get back some context.

Louis-François Bouchard: This is Louis François for the What's AI podcast, and in this episode I receive Jerry Liu, co founder and CEO of LlamaIndex. LlamaIndex is in the center of the [00:01:00] RAG industry, which stands for Retrieval Augmented Generation. Basically, combining large language models with your own data. In this interview, we talk a lot about RAG and LLMs.

Jerry shares a lot of really applicable insights on how to improve your current RAG or LLM system. I'm sure you'll take out a lot from this episode. If you enjoy it, please don't forget to share it with friends and leave a five star review or a like depending on where you are listening this episode.

Let's dive right into it. 

Jerry Liu: Hey everyone, my name is Jerry. I'm the co founder and CEO of Llama Index. for those of you who don't know Llamaindex is a data framework and platform to help you build LLM applications over your data. I'm really excited to be here. Thanks for the invite. 

Louis-François Bouchard: Of course. Thanks.

Thanks for you to be here. And before we get into Llamaindex and all the cool topics, could you share a bit more about. When and how did you get into the space? 

Jerry Liu: Yeah, so I've basically been in the AI space for most of my working career. so I graduated in [00:02:00] 2017 at the time I was playing around with the initial iterations of generative models at the time.

It was around GANS. so Jared of adversarial networks. For those listeners that are aware, at the time, they were pretty basic. And I remember being really wowed by the fact that, you know, you can basically generate bedrooms, or, or just the 64 by 64 renditions of bedrooms using these very basic models.

And so I did some basic deep learning research to try to train some of our own GANs, extended it to a 3d setting. And that kind of got me into the whole AI space as a, as a whole. Right. And it's very broad. There's the application, like ML engineering space, there's the data science aspects, there's pure ML research, and of course, there's kind of like the ML ops, like tooling to support the practitioners.

so I've kind of, dipped my toes and pretty much, all three of these spaces throughout my working experiences, but I didn't actually play around with. LLMs until around this time last year, back in October of 2022, when I was just starting to dabble around with the GPT 3 API to explore the [00:03:00] capabilities.

I understood how generative models worked from the theoretical, like kind of conceptual side of things, but I hadn't really explored the full application potential. And so at the time I was just trying to hack around on some applications, using GPT 3. And that's when I discovered the need to basically have some tooling and abstractions to build outline maps, on my own data.

I was trying to figure out how to feed, my own data into, into, like GPT 3. And that's basically what kind of kicked off the whole inspiration for this project.

Louis-François Bouchard: And going back a little, why did you choose to directly go into the industry, for example, after your degree, instead of like master's PhD or do AI research, like productizing instead of research?

Jerry Liu: That's a good question. I don't think. anyone's actually asked me that. and, and to be honest, I think I actually did consider going back to a grad school for kind of like a master's PhD. I'm a bit of a weird case where like, honestly, most people, like, if you are really into research, a lot of people around that time started applying to these like [00:04:00] residency programs and then going to PhDs at these different grad schools, whereas like, if you kind of knew you didn't actually want to do research, you stayed mostly in ML engineer or working in an applied field.

I liked both. so actually I applied to the residency program. I was an Uber AI resident from 2018 to 2019. That's actually how I met my current co founder, Simon. So I did research, for about a year and then that extended to two years under, Raquel, who now heads Wabi, but was the head of research at Uber, R& D in Toronto, working on self driving cars.

And so I did have like a solid stint where I worked on deep learning research and I was deciding whether or not I want to continue this at grad school. but at the same time, I think the other direction pulling me back is I fundamentally really enjoy hacking on things, and thinking about how to build products.

And so that, I mean, you can kind of see flavors of that and what we do today, but, you know, that also motivated me to join, an ML, MLOps startup after my research. 

Louis-François Bouchard: I guess you love the more applied way of learning as well. [00:05:00] And just like. Yeah, you basically learn to apply, I guess, just like I do. And so you mentioned GANs and you, I'm sure you understand the basics well.

I will have like two sides of the questions. On first, for example, for working with large language models and the recent AI systems. What is your thought on the necessity or utility of understanding how transformers work? And the second part is, is it important to understand also the basics? Like you mentioned GANs, but also the, the other, like more basic neural networks and math behind the models.

Do you think this is necessary? Is it useful? Is it useless? 

Jerry Liu: I think it's kind of like understanding, computer architecture. I don't think anybody really understands how a computer works under the hood unless you majored in a PhD in electrical engineering, right? And you directly work on kind of like the transistors.

I think, people are going to understand these things at different levels of abstraction and there is basically infinite depth in ML. I actually think even if you look at the [00:06:00] subset of PhD researchers, In machine learning, I don't think any of them fully understand, or all of them fully understand the theoretical math, right?

Unless you're directly working in theoretical ML. so I, I do think it's just impossible for everyone to understand everything. I think realistically, though, I would say if you're, for instance, like an AI engineer working on applied, you do kind of need to have the user experience of playing around with stuff like Chat

GPT. Understand the overall, application frameworks, for instance, retrieval, augmented generation agents, like what, what, what like types of application use cases are emerging from these models, even if you don't like, fully grok the models themselves. And I think you also need to, like kind of learn some best practices.

We can talk about this too, but like, I do think, most AI engineers these days, at least probably should have an understanding of evaluation and how to really benchmark things against a dataset for stochastic systems. at least right now. And who knows what will happen when they get better. and of course, like once you get, if you really want to get deeper into [00:07:00] AI so that you can try fine tuning models, so that you can try building your own models to kind of learn that additional layer, then, yeah, I think learning some of the basics around, you know, how these models work, how, you know, basic kind of like math stats, like back propagation, those things are all helpful.

Louis-François Bouchard: And I wanted to, to dive into that a bit later in the podcast, but while we are talking about this, what would be right now or in the near future, the useful tech stack to learn, to get into the field, for example, someone that is in another field or, or currently like a college student or whatever, and they don't have a programming or any background whatsoever in, in artificial intelligence.

Related space, but they will, they are super intrigued by LLMs and they want to build like an app or they want to work in this field. What would be your recommendation on what they should learn and how would you recommend them to, to learn 

Jerry Liu: this? Yeah, so I'm kind of starting off with the assumption that, that this, person at least kind of knows how to program.

Cause if [00:08:00] you don't know how to program, there's like a different flow. but if you at least somewhat know how to program, whether in Python or TypeScript or Java, even, right, you can at least learn, kind of the basics of, and so what, what are the basics one is, choose a large language model, the easiest to start off with is probably Open AI, so that you can just, you know, get an API key and you can start like using it, feed in some input, get back some output, allocate a budget for yourself. So you don't like blow the monthly limit, right? Like I said, you don't accidentally use your before and rack up a huge bill. and then the next step is to, pick a data source, or basically pick a use case, actually, but for a data source, figure out what you actually want to do.

Is it like you want to answer questions of your documents over your class notes over your database? is it that you want to actually kind of send messages, right? Or build a personalized chat bot with conversation memory on from there? Learn some of the basic concepts of rag. so retrieval augmented generation, as well as agents, as [00:09:00] well as like, how these general models work.

and so those for, for those, like, I would basically, you know, recommend a framework like, like LlamaIndex. We have a bunch of educational resources on this for both beginner to advanced users. and then for vector databases or or just like storage systems in general, I think that stuff comes into play if you actually have, like more data requirements, right?

And obviously that's kind of at the core of what we do, but just like practically speaking, if you find yourself wanting to try to understand like a large corpus of data, then you should pick like a storage full like a vector database. These days, there's a lot of vector databases out there. So you have a lot of choices.

You can probably pick one to start with. Or honestly, for LlamaIndex, you can just use our very simple in memory one. That's not scalable, but just works out of the box. So you can try that too, if you want. 

Louis-François Bouchard: I definitely want to talk about RAG, as you mentioned, but first, maybe it would be most useful to, to dive a bit more into LlamaIndex just because I think it's [00:10:00] a, an amazing tool.

And I would like to demystify a bit for the people that don't know much about it, or like, are just unsure of if, if, if they should be using LlamaIndex or LangChain or their own thing or whatever. so. First, could you share in your personal opinion, when should one use LlamaIndex and who should be using it?

Is it like, is it for programmers or for like, who, who is it for and when should we, should we use it? 

Jerry Liu: Yeah, definitely. So llama index is a developer tool. So it's oriented towards Alex programmers right now. at least and we'll kind of, maybe see what to do in the future. but it's basically at a very basic level is a python package as well as a typescript package.

so you pip install or MPM install package. And the main goal that we have right now. Is to provide the tools to make it easier for developers to build LM applications over your data. and so if you have some data [00:11:00] source, and you want to build some sort of knowledge, augmented, chat bot over it, so that you can ask questions of your data, get back answers.

you can use a lot of index. Llamaindex is very broad right now. And so basically any LLM application you want to build, you can basically build with Llamaindex, right? the main reason I say kind of LLM applications of your data is that's kind of been a core focus of, of what we, of the company basically for, for the past few months.

so basically this includes stuff like rag, like building, some sort of chatbot over your data. This includes stuff like structured data extraction. This includes like talking to your sequel database. If you want to run like structured analytics, these are pretty popular use cases within the enterprise.

If you want to do summarization, you can also build a random cool stuff, like, like just, agent simulations, for instance, or kind of like auto GPT, like experiences where you can, you know, have a conversation with, with, with this like chat bot and, and kind of maintain conversation [00:12:00] history. we haven't invested.

As many efforts there in terms of like first class, like in terms of just like higher level abstractions, but you can certainly build it. and so maybe one, one thing people think about is when to use Llamaindex or just call the open AI API yourself or write your own tooling. it really depends on how much time you have.

and I think time, time is kind of like a valuable commodity these days. Like, I think, it just requires more boilerplate for you to set up a lot of these abstractions and make them robust. Versus using some of our stuff. And we also have a lot of educational materials right within our modules to show when to use certain modules for which use cases for both like simple to advanced applications.

Louis-François Bouchard: Could you quickly share either the, the differences or your advantages compared to, for instance, Lang chain or, yeah. Training from scratch. not training, but building it from scratch as you, as you said. And lastly, the recent, OpenAI assistance. Like when, when should one [00:13:00] use LlamaIndex or what's the, the particularity or advantage of, of using LlamaIndex is standoff all of those. 

Jerry Liu: Yeah, I think I touched briefly on the building your own from scratch. and, and as to differences between LlamaIndex and Langchain, there's a very popular question. I think, you know, we're both frameworks, at the end of the day, you can basically build whatever you want in either framework.

I would say, Langchain has invested a bit more broadly in just a variety of different things. I think we've been, pretty focused specifically on tooling abstractions for building kind of, stuff around your data. And one of the most popular frameworks is probably RAG. And the way we think about a lot of these abstractions is how are they just extensions on top of rag to basically provide even more advanced like search analytics over your data.

So this, you know, includes trap bots. when we think about agents, we typically think about it in the data analysis use case. And so we try to make our abstractions both like very customizable for the advanced user, but also very out of the box and easy to use for the more beginner user. And so that's kind of our main focus.[00:14:00] 

and, and, how we think about the differences. As for the assistance API from OpenAI, yeah, it's, it's good. I mean, I think they've released a lot of features during Dev Day, which I'm sure you might ask me more about in just a bit, but on the assistance. API specifically, what it is right for listeners is it's basically like a hosted agent like experience that's capable of in house retrieval and code interpreter as well as function calling for any tools that you pass in.

I actually think it's pretty complimentary because in the end. What we really want to focus on is really good search retrieval experiences of your data. And we actually have an assistance API wrapper that's an agent. And we demonstrate that you should really just use llama index, rag pipelines as tools within this assistance API.

So to have the assistance API do function calling on top of stuff that you built with Llamaindex. So use Llamaindex for it. So for instance, index your data, right? Your diverse data corpuses, their own retrieval API right now is quite basic. I would not [00:15:00] recommend it for anything more than a toy data use case.

And so, you know, use the capabilities that we have to offer over your data, plug it into the assistance API, the agent, and then see what happens. 

Louis-François Bouchard: I assume another advantage is also that you can, yes, use OpenAI APIs to, to do whatever, but also it crashes sometimes and even often. And so you can definitely easily, not easily, but you can like, if, if your query to the OpenAI API crashes, you can also ask another language model, either your own or like cloud or, or, or whichever.

So I guess that's also. quite beneficial from using something external as OpenAI as LlamaIndex 

Jerry Liu: yeah, exactly. so in general, there, there's, the broader point about, what's the point of LlamaIndex if OpenAI, you know, is releasing all this stuff. and so just to address that, basically, as you said, you know, like, the [00:16:00] space is very competitive.

Gemini, just released as of the time of this recording, right? And so, And then, of course, there's a lot of competition on the open source space as well. And so what we've seen from a lot of users is they want choice and they want good tradeoffs. And so they want to have, you know, not have vendor lock in so that they can pick and choose, you know, maybe plug in some open source model with open AI for different use cases.

and frameworks allow you to basically just do that very easily, and save you time and actually trying out these different abstractions. You mentioned 

Louis-François Bouchard: Gemini, and I actually wanted to also, talk with you about that because you, you just, you recently posted something about Gemini, sharing insights and, and your thoughts, I would love to, to know, what are your thoughts on, on Gemini?

Do you think this is a big thing or it's, it won't even compete with GPT 4? I've seen that they haven't even released the, the, like the bigger model turbo, or I don't remember the name, but they haven't released a bigger model yet. So yeah, [00:17:00] what are your thoughts on Gemini? 

Jerry Liu: Yeah. So just for some quick context, I know about as much as you tweet was literally just from reading the blog post.

but I think Gemini, in principle from reading the blog post is, is very exciting. it's a multimodal model, from, from what they released. If you take a look at some of the demo videos, it interleaves text, image, video, reasoning, audio. And one of the exciting aspects is the multimodality. And the other aspect seems to be that the latency is pretty fast.

I think what's exciting to me. Is not necessarily, whether it, it completely outperforms dupty for, right now, what I seen is it seems to have like some incremental improvement, but not necessarily like a huge step change, on, on kind of like the pure text based reasoning, aspects. What's exciting to me is.

Making multi modal, stuff practical and easy to use. these days when you build multi modal apps, you stitch together a bunch of very disparate components that are not end to end optimized. You stitch together like an LLM, like a [00:18:00] GPT. You add like a text to speech service, right? and then a speech to text service.

and then you, you use that to try to like, bounce conversations back and forth. a big issue in a lot of these applications, not just audio, but also image and video, is latency and speed and actually making sure you have some good end to end pipeline. So I think if, Gemini can actually be a universal model that can process and crunch a lot of data very quickly, I think it has a lot of exciting potential for basically, kind of, more advanced.

rag use cases, right? And I think that's something that we're pretty excited about. But also, you know, in general agentic use cases, too. So rag use cases include stuff like being able to crunch like charts tables within a document, being able to process websites and do structured data extraction better than just pure text processing.

We're also excited about. For instance, like being able to kind of do some sort of augmented sensor captioning, like given a video, you can just like generate a bunch of [00:19:00] text, right? And have a conversation. but also kind of have a, like, just, also be able to like index your own internal knowledge corpus.

And so the stuff you index is not just text data, but also like image, video, audio data, and those things. I do think this will probably become a bigger piece in the future so far. We'll be seen with, I think open source models and, and you have already made a decent amount of advancements in the space, but there's still some like gaps to make this like fully applicable in, in, in the production side.

Louis-François Bouchard: And right now, how do you deal with. Multimodalities and like images, tables, PDFs, everything. How do you, how do you deal with that at LlamaIndex? 

Jerry Liu: So pre multimodal models, the multimodal piece we're still exploring actually. And so that's kind of a work in progress. And, and Joe and I will help accelerate some of that too.

the, the main issue, by the way, with, playing around with current models as of the time of this recording is a lot of them are either, a little slow to use or they're heavily rate limited, like, like GPT [00:20:00] 4. so. Generally speaking, processing a complex document is quite interesting to us, and we've actually posted a lot about this in the past.

How do you, for instance, process like, an embedded table within a PDF? Like, if you have a SEC 10 K filing, an archive research paper, any sort of legal brief, like you have a bunch of like charts tables, right? And you want to somehow index that and so we have a lot of abstractions within llama index that basically allow you to kind of like hierarchically summarize and model this, this entire document, not just as like a flat list of text trunks, but actually kind of like a document graph.

so you have an entire graph of like node objects that link to other node objects. And when you do retrieval, you don't do retrieval just on like a flat top K of like the document text trunks. you actually do retrieval over this document graph. and this allows you to basically query and understand different objects representations within this [00:21:00] data.

So, for instance, for tables, you don't necessarily need a image screenshot of it. You could use like a tax parser, right? And then kind of clean that up or model it like a CSV or data frame for charts. It's a little bit more tricky. There's a lot of like OCR models right now, so on, on charts. and, and some of them are pretty interesting, but I think there's still a gap in really understanding some of the complex stuff.

and so maybe this is something that multimodal bottles will help with. 

Louis-François Bouchard: Yeah, I hope so. Yeah. We are also facing lots of challenge with like the different PDFs. I guess that's an ongoing problem for, for as long as PDFs exist, but, yeah, really cool. I, I'm, I definitely hope Jim and I will, will help for that or just other multimodal Models.

Would you have any tips for companies or individuals to ensure that they have good documentation and like good communication skills? 

Jerry Liu: I feel like our documentation has historically been not that great. A lot of this honestly was from our VP of developer [00:22:00] relations, worry. And so he basically came in and revamped the documentation actually, you know, again, as of the time of this recording, we're going through like another revamp to try to, you know, add more resources.

We've added more stuff and, and to try to rearrange some of the sections to make it a bit more clear. I think in general. I probably kind of, think about the audience. and I think for us, what, like, again, this is from the context that the documentation has historically, been not great. Right? And so I think what the audience that we really tried to cater to is the fact that for the vast majority of.

Developers, they're still onboarding onto the concepts, and so they, they want to see the documentation as an educational resource that supports them throughout their developer journey. I, for instance, am not the target user of the first, like, 70 percent of the documentation, because I already know the modules, the abstractions, I would go straight into the API reference as well as like some of the module guides, just like copy and paste reference code.

But for the last majority of users, we wanted to frame the documentation [00:23:00] such like, such that it was a. Journey from, you know, your beginner, and rag, you're just starting out. You go through the quick start to you're actually trying to build like a full stack rag application. What are the things that you try to set up?

Oh, here's agents. Like, how do you think about that? And how does that like factor into this experience? So now you're trying to optimize this entire system. What are strategies, tips and tricks you can do that? And so I think that really did help a lot because it basically made it so that the documentation was a lot more accessible so that you could just read through top to bottom.

Louis-François Bouchard: So I assume it's just like. In my case, creating videos, you need a store, a good storyline. And even if you are just explaining something, it follows a story from an introduction to like development and et cetera. So it's kind of the same thing to have better documentation. 

Jerry Liu: Right? Exactly. I think that's a key principle.

Louis-François Bouchard: Nice. That's, I think that's something I've never heard about documenting your code. It's, it's really cool. And I hope to see that more in the future for [00:24:00] other people. 

Jerry Liu: Yeah. And for the audience, by the way, like I, you know, we're, we're always improving it. I actually don't think the documentation is perfect.

There's definitely things are missing or could be improved. So if you have feedback, please let us know. 

Louis-François Bouchard: Perfect. And now, yeah, I I'd love to dive more into the retrieval augmented generation, since that's a big part of, of what you're doing. And first, You already explained what, what it was, but could you share your insights on why is this so hyped and so popular?

Like why do everyone wants to do retrieval augmented generation instead of other alternatives? Just for example, as a fine tuning, where you retrain on your personal, you retain a powerful model on your personal data. Why is right? So popular now, 

Jerry Liu: I think that's a good question. I think, first, maybe I'll explain what retrieval augmented generation is.

and then I'll talk, I'll talk a [00:25:00] little bit about why, it's probably the main enterprise use case these days, and there's probably a reason for that. So what is retrieval augmented generation, or rag for short? So rag basically means that you fix the model. The model doesn't change. you're not training it anymore.

It's already pre trained. You're just using the opening I A P I out of the box or a llama to or whatever. and then You take some data corpus, right? And rag is just the way you combine the two. So the way it works is you load in some data from your data corpus. It could be PDFs, databases, CSV files.

okay. Actually not CSV files. Forget about that. but like, some sort of like semi structured unstructured data. and then you want to index it and typically you index it into a vector database. So you take in some of these documents. Trunk it up, embed them, put them into a vector database. and then, now that the data is in your storage system, you can basically build this RAG pipeline where, given a user query, you first retrieve the relevant context from the vector database.

and [00:26:00] vector databases expose an endpoint where you can basically fetch the most similar documents given a user query. And then you take This like text, and then you stuff it into the prompt. So, you know, just imagine a giant text string. There's some room in the middle. You take all that text and just dump it, dump the documents as plain text into the input prompt.

And this is the thing that you use to basically, generate an answer. So the prompt template looks something like, you know, here's some context, dump the context in. Here's the question. Dump the question. And then you get the answer. that's right. and you can honestly replicate this experience by just opening up tragedy in your browser and just taking some random web page that or article that you see and copying, pasting a text, put in a tragedy and ask it to summarize stuff for you, right?

For anyone using tragedy, like people do this all the time. That's basically a rag, which is automated, right? And some sort of like systematic setting. And so, why is this so popular? Yeah. one, because it's a core [00:27:00] use case for search and retrieval. I think the time, so, so one, it's a very valuable use case.

I think we've talked to a lot of companies, a lot of them see a lot of value and being able to extract insights from their kind of unstructured data and a performant fast and cheap way. two, it's really easy to set up, in Llamadex. You can get something working. The basics and five lines of code, right?

You don't need a training data set. You don't need human annotators. You actually don't even need to wait 30 minutes. You can basically get the setup immediately. a third part is that, other use cases that are cool and exciting, work less well. So agents are cool and exciting. There's a lot of cool stuff around agents.

In fact, there was a lot of hype around agents back in April and May. but as people started playing around with it, they realized that there's a lot of use cases that they weren't quite able to solve, right. And, and so I'm sure people get more excited about agents. We're seeing people branch out from RAG into agent like stuff, [00:28:00] but just in slightly more limited fashion.

and as these models get better, like we'll see kind of like step changes and capabilities of agents. but for right now, RAG strikes a nice balance between it solves the use case. It's easy to set up and also it works like decently well. 

Louis-François Bouchard: Yeah. That, and it also reduces hallucinations compared to. Fine, fine tuning.

This is just a quick interruption to remind you to like or leave a 5 star review depending on where you are watching this episode from. It helps the channel a lot and it's an amazing way to show me if you are enjoying this episode or not. Thank you for watching and I will let you enjoy the rest of the episode.

Could you share a bit more on when would you use A fine tuning of a model rather than, than rag or vice versa. 

Jerry Liu: Right. So, so now the, the rag versus a fine tuning, debate, I think for the vast majority of developers, if you have some knowledge corpus that you want to understand, and, and you want to basically ask questions over in a search or retrieval setting, [00:29:00] you should do rag.

I think fine tuning, has the theoretical, the conceptual capability to do everything that rag does. because if you think about it, like, you know, chat to PT itself is trained over some corpus of data. You ask it about stuff, they'll be able to, you know, regurgitate things and, and, and, and, you know, it can answer questions for you.

Right. it's just practically speaking, there's no magical end point right now where you can just fine tune something and it magically learns over your data. most fine tuning tutorials, if you look on the web, we actually have some fine tuning abstractions are either very incremental, or kind of for limited use cases.

And so you can't just like, just run this random process that will fine tune in the background and just automatically memorize every new piece of information that you give it. so that's one. It's just like the kind of UX is not there. and so fine tuning takes more time. It's hard to set up and also has more incremental use cases.

that said, I think fine tuning will just generally do things that rag, [00:30:00] can't do because it can just do some sort of like overall training over the entire data set. This includes like better adherence to any sort of system prompts that you set up. So if you really want to re like on RLHF the model and then re RLHF it with like kind of, kind of your own custom instructions, that would be a way to do it.

So that could act more consistently according to your guidelines. you can get it to, for instance, output stuff in certain styles, those types of things. but we actually, we have some basic support for fine tuning and Llamaindex. for instance, we have tutorials on fine tuning LlamaLlamatwo or other models for use cases like structured data extraction and text to sequel.

Those things we also have used fine tuning to fine tune embeddings for better retrieval performance. That's actually another aspect that people oftentimes miss. Is that you can, you should actually probably fine tune the embedding model. once you really start to optimize these systems. so they're definitely complimentary.

it's possible that fine tuning replaces some parts of [00:31:00] rag in the future. We just don't see that yet. 

Louis-François Bouchard: Yeah, you can both optimize the indexing part of your data as well as. The, the, the querying part, could you share a bit more on the, how, what's the basic of, of both and the first, how much you can improve, but also how you can improve them.

Jerry Liu: So yeah, the way rag works, again, as you, you know, index some data, and add, what does indexing mean? It means. You know, you take each document that you're putting into a vector database, and then you add, you embed it, right? So you feed it to an embedding model, and you get back an embedding. This embedding is basically the, the kind of index on top of this data.

and this serves as the thing that you do to do, or that you use to do some sort of like And so for this embedding. you can fine tune the model to output better embeddings. with a pre trained model, like you, are using like a pre trained hugging face model, for instance, they're typically pre trained on large amounts [00:32:00] of data, but not necessarily specific to your domain.

So given the types of questions that you want to ask, if it's very domain specific, that that embedding might not actually be optimized so that the relevant context is retrieved for the question you want to ask. So You can actually try fine tuning this embedding model. There's a variety of ways you can do it.

One is if you have the model weights, right? If it's just like a hugging face model, you can basically download it, fine tune it yourself. If you know how to write PyTorch, there's also services out there, right? That will kind of do this for you. and a lot, like we have some abstractions here that try to make it also pretty easy for you to use these different services.

this includes. actually, wait, sorry. Let me, let me scrap that. I, I can't think of any off the top of my head that, that do pure embedding fine tuning. I'm sure they exist. I just, I just can't, guarantee that the companies I'm going to say are going to be correct. but the, the other thing you can do, and, and this is also a conceptual thing that that's pretty interesting is you can basically take an existing embedding, right?

So you don't need to [00:33:00] fine tune the base model. You can take an existing embedding and it can be generated by a black box, like open AI ADA. yeah. And then you fine tune a transform on top of this embedding, right, to basically transform this embedding representation into another embedding that better models your specific data.

So it could be a linear transform, it could be a neural net, but you, you can basically fine tune an additional adapter model on top of the base model. Right. And, there's kind of trade offs complexities if you fine tune this on the document side versus the query side. but, you know, we, we actually have some basic, capabilities in LLM index to let you do that.

And we also encourage users to try doing that themselves, but it's a pretty interesting principle because it's like you can take a frozen model and just fine tune some transform layer on top of it to basically adapt to your data domain a little bit more. On the LLM fine tuning side, I think some core use cases we've seen that we found pretty interesting is [00:34:00] you take a weaker model.

So, for instance, a cheaper of a weaker model, like like a llama 2 and you try fine tuning it to better output stuff like structured outputs. a lot of rag use cases we've seen a lot of users desire kind of like outputting stuff in json format. And the best models for outputting stuff in JSON format right now is like GPT 4.

so if you're able to fine tune the smaller models that better obey this type of task, then you can basically use a smaller, cheaper model. and so we've, that, that's like a use case that we're pretty interested in, and then some, some other ones include like being able to just generally distill, kind of, the prompt and instruction following, from a more powerful model, like GPT 4 into a smaller model.

Louis-François Bouchard: And regarding the. the other sides of the, the RAG system, for example, chunking, like you need to divide your data or do you, so what, what's, what's the importance of chunking size and also this, the chunking [00:35:00] strategy that you are, that you want to take, how, how do you make this, how do you select which strategies, strategy to use?

And What size each chunks should have? 

Jerry Liu: That's a good question. I think overall people don't think enough about the data pipeline for RAG, because they use some very standard chunking strategy, and this turns out to be suboptimal, and that's actually why their RAG pipeline is failing. the base, the few variables or parameters you have to think about when you build a chunking strategy is one, the quality of your file parser.

So how, how well are you actually parsing out text from, from the PDF, right? Let's just say a PDF as an example. There's different types of PDF parsers available. Some are better than others. some are, better or worse at extracting more complex things like, you know, two column format, like headers, tables, those things.

the next step is trunking. So, how do you actually split up the text? By the way, first of all, why do you actually need to chunk things? And the reason is [00:36:00] to just reduce a bigger document. It's a smaller bite size context, right? so that one, it doesn't overflow the context window when you do retrieval.

two is it reduces the number of tokens that you use and as a result reduces, cost, and also reduces the speed, the time it takes to actually generate a response. and so. But yeah, the way you the way you chunk the documents, does impact retrieval performance quite a bit. I think generally with kind of like flat chunking, you, you kind of see like a U shaped curve, in terms of error rate.

So if the chunk sizes are too small, you don't actually return enough context. so that, you know, you can actually properly answer the question at hand. If the trunk sizes are too big, you start running into attention, loss in the middle problems, as well as like obviously increases in cost and speed.

Like sometimes the relevant context is just lost in the middle of a very big chunk. And so the LM doesn't quite understand what's going on. and so I do [00:37:00] think the world will probably move in the direction of bigger and bigger trunk sizes as these models get better. But for now, this is something you have to tune.

yeah. And the other part here is just like, The something we typically cover in like the advanced retrieval section is like the whole idea of just arbitrarily splitting text by sentences, paragraphs, whatever. It's pretty arbitrary. and so sometimes you just inadvertently start splitting context down the middle.

Like you might have a relevant section and because you pre split it. the relevant section is now cut in two. And so unless both of those chunks are retrieved, you're not going to have all the context you need to answer a question. so it's because it's kind of arbitrary. I think there's definitely improvements to be made on like, how do you better chunk and process these things instead of just doing it by, you know, every sentence paragraph or by a fixed chunk size.

That's something we're actively exploring. 

Louis-François Bouchard: And what's the importance of, of. Data quality inside those chunks. Do you have any techniques or to find like irrelevant chunks [00:38:00] or are you curating them or like, because for example, I, I assume a lot of companies will build rag systems based on their internal documentation and whatever they have, but lots of things.

Are obsolete or like older versions and, or just, it, if you're, your chunking is a bit weird, it may have the last sentence embedded, which is completely use useless. So how do you curate or improve the chunks? Do you do. Do you have any insights on post processing or do you do anything to improve that?

Jerry Liu: Yeah, I think that's, that's a good question. And this gets like relatively deep and honestly, like there are some thoughts off the top of my head, but this is also something just, you know, transparently we're like, actually working on to try to understand ourselves. There's pieces both on the ingestion as well as the retrieval side that you can do to try to actually improve the final quality of the context that's retrieved, so that your alum has access to, [00:39:00] like, the most precise, relevant information, on the ingestion side, right?

You do need to spend some time picking like a good P. D. F. parser. and so, some, for instance, like, like unstructured is actually pretty good, I think, for being able to partition out like tables and charts. So you can do stuff with that later. and also some P. D. F. parsers are pretty bad and that they just like.

you know, just the, you can tell like the text itself is just like very messily formatted. the other piece is I would probably spend just, you know, some basic effort setting up an evaluation benchmark and seeing which chunk sizes, actually lead to the best generation performance, right? if, for instance, you know, you're, you're kind of a chunk size is a parameter that you can tune.

in terms of overall splitting strategies, there's different splitting strategies. There's like sentence splitting, paragraph splitting. If you can, I try to preserve like sections so that like contiguous sections aren't split in half. So, for instance, if you're splitting like a markdown file, it makes more sense for you to like, keep a contiguous [00:40:00] section versus just like have some overlap between 2 sections.

and we have actually a markdown text splitter that tries to preserve that to the best of our ability. and then the other piece here is, There's more kind of like data improvements you can add on top of the chunks themselves. and actually there's a few that I'll talk about. One is metadata.

this is actually really important because if you just have the raw text chunk, it's not contextualized. the, the embedding of that text chunk or the LLM will not really know where this text chunk falls in relation to anything else that's retrieved. Even adding stuff like the file name, a short summary.

Or kind of like a higher level, abstract, like a abstractive summary of like what this thing is about, can help retrieval, metadata filtering and a variety of other things. there's more stuff here too, but this is just like, you know, kind of some, some aspects that users consider on the, on the injection side, on the retrieval side, I would typically say, you know, there are a lot of the [00:41:00] straws from traditional retrieval practices.

Of how do you retrieve the most relevant context given the query? Part of this again is the data quality that we just talked about. The other is the algorithm that you have typically for most production retrieval pipelines that we've seen, you know, even independent of the outline, you have some like two stage paths where you first do some sort of like top K retrieval via embeddings.

And then the second stage is you might want to do some re ranking, right? And so fusion between like different retrieve results. And even actually before that, before you actually launch a query, You might want to do some sort of like where you're rewriting or decomposition to basically decompose a complex question into smaller ones.

a general reference advanced architecture, right? Like, I'm not sure if this is the most general form available, but certainly I have like decent confidence. I'll probably do better than the basic stuff is. Start off with a query, decompose it into subqueries if you can over like different tools. and so, you know, Llamadex does, have like a bunch of abstractions to help you do that.[00:42:00] 

Decompose it into subqueries. Now for each underlying, like smaller query, execute it against, some sort of retriever. And you can have, for instance, like multiple retrievers, like one doing hybrid search, one doing keyword search. and it's actually okay, it's called like Ensemble Retrieval, right, so you have a bunch of candidate chunks.

And then this is your first stage path, so all of a sudden you have a bunch of like chunks, you know, all from like different retrievers, some of them might be duplicates. And then you do like re ranking, filtering, that's like the second stage piece. Where are you actually try to filter out for the most relevant ones?

This is typically using a more powerful model than like, just pure, pure embedding model. Yeah. Cohere, for instance, has like a re ranker. you can use like the LLM itself to re rank stuff. You can use a crossing coder model. There, there's a variety of these are publicly available. and you finally get back the relevant context.

They can feed to the language model. so this is just an example of architectural, right? But these are like general practices that people should think about. 

Louis-François Bouchard: Yeah, super interesting. [00:43:00] would you have any insights on, you mentioned the OpenAI Dev Day and now evaluation. So I know that in the Dev Day, they mentioned that they were using RAGgers to build.

An evaluation pipeline, is it the one you would also recommend or how, how would you evaluate any new RAG system? 

Jerry Liu: Yeah, RAGgers is great. I actually, I should play around with it more in depth. but I, I know the kind of project creators, it's, you know, it's, it's a great framework. we also have some of our own eval modules.

But they're very basic. and so we're trying to look at, both improving them, but also integrating more with can these third party providers, I can maybe talk about conceptually, what evils consists of, because we also just did a deep learning. ai course with, Treyera actually on like building and evaluating advanced rag applications, which, all of you should check out if you haven't, in terms of evaluating rag, just conceptually, there's a few key areas.

One is, you know, you have this question, [00:44:00] that's asked and then you get this, like, predicted response. and you also have the retrieved context, right? And if you have like a ground truth answer or ground truth, retrieve context, you can measure both generation as well as retrieval metrics based on how close this reference answer or this predicted answer is to a reference answer, as well as how good the predicted context is to the retrieved, like ground truth context.

and then. If like, but then like, you know, that's a basic metric. It's like a correctness metric. and then there's also stuff beyond that. So, you know, you can actually measure the predicted response relative to the context. this is called like faithfulness. and so if the response actually you found isn't actually generated from the context that's retrieved, this means the LLM is probably hallucinating.

You can also measure whether or not the response, one adheres the guidelines, like structured outputs, those things, but also whether the response, answers the question, right? And this is like relevance. It does is actually answer the question. so there's both [00:45:00] retrieval as well as like generation metrics that you can define.

and then you can also have like defined metrics and a label free or with label setting. and all of this, by the way, for retrieval metrics, you can just compute standard ranking metrics. So stuff like MDCG, MR, that type of stuff, for comparing the quality of general responses. what a lot of users do is they typically use an L1.

All right. And so you, you, use like GP4 as a human judge, and you can basically judge whether or not the quality of this response matches the quality of another response. 

Louis-François Bouchard: Awesome. Do you think. RAG is the future of, of, memory based language models, or will like GPT 7 be good enough to just first answer all your questions.

But I, I understand that there are some private data, so there will necessarily be, be something to do with, with this kind of, of [00:46:00] use case, but will, will it become like super easy to, to quickly fine tune a model? What are your thoughts on? On the future of LLMs, is it, like, basically where I'm trying to go is, do you think LLMs will become powerful and easy enough to use that you will only require them?

Or is it still like just a small part of a bigger product that will all make it work much better?

Jerry Liu: Yeah. I mean, I think there's a few points in here, and, and I think some of these are actually pretty interesting to think about. And some of these, I don't actually have the. Right answers for we can mostly speculate and some of these are things that obviously I probably hope to be more true than others because then, you know, we're building a framework.

That's kind of like a at the orchestration side. So, one is whether the current form factor of rag will last. to be honest, like, probably not. In fact, like, you can already see this evolving. Like, I mean, okay, what, by the way, what is the form factor of rag? Like, and Okay. By that assumption, I mean [00:47:00] specifically, taking in a bunch of text, chunking it up using some like naive, like text butter and then doing like top K retrieval, right?

Like part of the things that we've invested in this entire past few months is to move beyond that because that basic stuff, like already people find doesn't quite work as well as people, people want. so I think that, like, just even conceptually, just in terms of the reference, like rag architecture, that the best practices around that will probably evolve, as to whether or not a lot of this retrieval, will get absorbed into the model.

I think retrieval itself will probably be absorbed into the model in some way. Like, you'll probably start to see one longer context windows, even infinite ones where, somehow, like, basically like the transformers architecture. will somehow leverage like top K lookup or vector database under the hood.

This started with stuff like DeepMindRetro, last year, stuff like the Memorizing Transformers paper. there will be kind of probably interesting developments on the model space. I don't think, [00:48:00] however, that just from my, own gut intuitions of like the ML research that this will, Basically, just obviate the need to do any sort of indexing storage of data, on the developer side, because, like, just being able to kind of index arbitrary amounts of data is still generally a pretty hard problem.

and and I think, like, and I think, like, just the amount of compute and kind of like the cost you need to actually solve this will still probably be quite high. I think there is a world where I gets better. I think right, or it gets much better very quickly. I think everyone agrees that AI will get better at, you know, longer context windows, understanding things much more cheaply and quickly.

and also the fine tuning aspect, you know, being able to just like fine tune on arbitrary amounts of new data. I think the question is how like compute constraint we are, right. And, and kind of like cost constraint. and I think, you know, it's, it's a very exciting feature. Like I, I think. I personally think I'm probably like somewhere in the [00:49:00] middle, like in between people think like, you know, all of this, like GPT four is the peak, versus like people think like AGI is going to happen in the next, like three months.

and part of this is also just like, practically speaking, I think most people are still having trouble accessing compute, to be honest. and I think also just at the kind of like looking ahead at like future developments of these models, the extremes still seem a little bit far away. 

Louis-François Bouchard: Yeah, I, I do agree and regarding more, maybe more specifically towards a skill that is currently being required to develop such LLMs based models, approaches systems is right now, like the systems, The biggest skills that I believe I'm not hearing a lot about, but Prompt Engineering, which was extremely popular, like a few months ago.

And now I think is way less popular maybe, but do you think Prompt Engineering is there to stay or will, [00:50:00] LLMs become like better and better to understand the English language or other languages to just easily use them? Like, is it, is it still. A promising skills skill to learn for any future LLM apps, LLM developer, or even users.

Jerry Liu: Yeah, I think for that, it kind of depends what you mean by prompt engineering. if you define prompt engineering as literally fiddling around with like the F string and, and adding like some random brackets and trying to, you know, like add some, special character or like stop token or something so that you can generate things.

Yeah, I think, I think that will probably go away. Right. but I actually think the need for prompting in general, will not, and I'm kind of defining this in a very general sense, will either stay the same or, or actually even go, go up, as, as time goes on, what do I mean by this? I mean, if you think about it, rag is basically prompt engineering, cause you're basically figuring out a way to put context [00:51:00] into the prompt.

it's just a programmatic way of prompt engineering. If we think rag will get better than that, we could see that as a way of prompt engineering. It's a way of prompting so that you actually get back some context. Any sort of higher level abstractions you build on top of LMS require some user description of a task.

And so no matter how much the LLM powered agent under the hood will solve stuff for you at the high level, you still need to tell it a task to solve, and it needs to be able to solve that for you. And so I do think this idea of prompting. will probably, continue to, to be a need because like, even if you're abstracting away a lot of lower level needs to specifically fiddle around with like a specific word.

and even with stuff like the assistance API, the input interface to that stuff is still English, right? And so the way you, the thing you need to do to interact with it is still with English. And so the need for English as an input is all going to exist. [00:52:00] 

Louis-François Bouchard: But do you think, similar question with transformers and the the basics to know, do you think in the user end they need to understand like prompting and be a better prompter or do they, should they not even look into that and not learn anything about that?

Jerry Liu: Yeah, I don't think I actually okay. So I don't think that the prompt engineering, category itself will necessarily last as like a differentiating, like, career title, if that makes sense. But I do think the ability to use AI will continue to continue. It will continue to be a differentiator because no matter how high these abstractions are, you need to figure out how to best take advantage of them.

To help, to, to help you basically. Right. Like people that are able to learn how to use chat GPT effectively get like, you know, 30 percent efficiency boost versus people that don't know how to use chat GPT. And so I do think that part will still be, [00:53:00] and, and might actually be more important as these models get better.

Louis-François Bouchard: Yeah. I've recently spoke with a product manager at Google and she said that like the skills, the skills she is now looking for is. Both, it is communication, but mostly communication with language, with like language models, because you can pretty much do almost anything if you, if you are good with using it, just like back in the days, if you are good with Google, you can find on Stack Overflow, how to code anything.

Now it's just like. Stack over flu on steroids. Yeah, exactly. And I have one last question for you, which is one I I often ask, but now it's, it's more personally for you. what do you use AI for? Do you use ChatGPT or GitHub copilot or any generative ai? What, what do you use it for? 

Jerry Liu: yeah, I use chat ChatGPT lot actually.

I use copilot. I've been meaning to try out some [00:54:00] other stuff like cursor, like basically coding assistant tools. Yeah. on our, on our own Discord, we use, Kappa AI for documentation handling. we use, amendable on our search, and we also use Doso Bot on our GitHub. and so we, there are ways and, and, you know, like, like, yeah, there's some like kinks, and like rough edges, but like, you know, like they're, they're, they're good.

Like they've objectively saved us time, right. In terms of actually like manually, like going in and trying to save, answer, issues or respond to questions like those types of things. I've been meaning to try out more tools. I actually use Llamandex, like itself, right. to, to try to like process like a.

a lease agreement, and, and to to try to like, answer questions over it. so that was fun. but yeah, I'm looking forward to like dogfooding a bit more 

Louis-François Bouchard: for, for coding. When do you use, or why do you use copilot versus ChatGPT? Or when Will, would you use each?

Jerry Liu: Oh, okay. Yeah. So, I mean, Copilot's just Copilot.

It's just there, right? I think, I think if it's just [00:55:00] there, I'm just going to use it. in fact, it's just there without actually me even needing to toggle a bar. so I think that's a thing, like it just shows up and you just tap, tap to do it. I think that's actually a pretty important UX. I've started using it because, sometimes I don't find answers on, on Google.

and sometimes the question I ask is somewhat complex and I need it to like, kind of basically synthesize something from like disparate sources of knowledge, and actually does a pretty good job. sometimes, yeah, but I, 

Louis-François Bouchard: what I meant is that I, on my end, I prefer to use for coding. Over co pilot and so I just wonder why, why you, I assume you are using co pilot because of UX, as you said, and I think that's also how AI will become more and more popular to the general public is, is by integrating it like easily to use in all current applications people are using, but have you tried coding or using ChatGPT as a coding assistant?

And [00:56:00] If so, like, why do you prefer copilot? Is it purely for UX or also results are better or? 

Jerry Liu: Oh, no, I mean, I do use tragedy to, to help with like kind of debugging. and I also help it to look up stuff that I forgot. Like for instance, like, Pydantic, like a method definition on to do a certain operation.

I actually do that. I think the reason I use code Palomar is just like, it's, it's, I mean, it's right there, like a code in VS code and also I'm more familiar with Python. So, so actually I think like, when I was trying to learn or code a little bit in TypeScript or even kind of like do some stuff with a Streamlet, like for instance, right, I was, I use tragedy a bit more to learn something new, but I think for stuff I'm more familiar with, it's, it's more, it's easier to integrate with my existing workflow.

Louis-François Bouchard: Yeah, that's a perfect answer. I, I am, I completely agree with using ChatGPT if you are learning and Copilot might, might be better for productivity and for just like coding faster. That's like perfectly makes sense. Amazing. [00:57:00] is there anything you'd like to share or like, where can people learn more from both you or LlamaIndex?

If you, if you want to share anything about the company or yourself, please feel 

Jerry Liu: free. For sure. Yeah, I mean, I think, I think, yeah, we covered most of the basics. The only two things I'll say is one, I think we're, one investing a lot more and kind of not just the advanced like rag techniques and those things, like on the core tech side, but also a lot more on like full stack kind of AI application development.

And so, if you have suggestions, feedback. We've launched products like create llama, which is like create react app, but for AI engineers, we've launched like chat llama index, sec insights, basically full stack application demonstrations of how AI, you access can be created. and then the other piece I'll say is, if you're an enterprise, like we're working at enterprise, we're always interested in seeing how you're adopting all of them and like different use cases.

So we'll, we're hosting office hours basically. So DM me and we'd love to chat and learn more about your use cases, pain points. [00:58:00] Super cool. 

Louis-François Bouchard: I'm, I'm really excited to see all the future updates coming and I will, I'm also using LlamaIndex and we are going to use it more for an upcoming project at Towards AI.

So super excited about that. But yeah, thank you. Thank you very much for your time, your insights. It was super cool to dive a bit more about RAG and to have an expert on this topic to talk with. I highly appreciate you taking the time and yeah, thanks again for this amazing 

Jerry Liu: discussion. Thanks for having me.