2023: The best AI papers - A Review 🚀

A recap of the research progress and important news in AI in 2023!

Louis Bouchard

Dec 24, 2023 • 7 min read

What a year it was!

A lot happened in the big tech as well as the research community in AI. OpenAI revealed its GPT-4 alongside its image generation capability DALL-E 3 and GPT-4Vision, Meta introduced their LLM model Llama 2, even Google unveiled their LLM models starting with Bard hosted on Bing and now Gemini, Stability AI released its video generation capability Stable Video Diffusion model, Elon Musk also made an announcement regarding their own LLM model Grok, and most recently Channel1, an AI news media was announced to go out early 2024, with almost fully AI automated and customizable news provided to the user. Yes, that was all in a single year in the industry. Let’s not forget the AI research community that also witnessed amazing advancements, which we will cover in this video.

On the flip side, this year also served some drama. There was the entire OpenAI controversy where they fired their CEO Sam Altman and then rehired him due to pressure from employees and the AI community. An AI pioneer, also known as one of the godfathers of AI, Geoffrey Hinton, quit Google to warn humanity about the threats that AI might pose. Lastly, we witnessed controversy surrounding Google’s demonstration video of the Gemini model and more…

In 2022, we witnessed the rise of the mighty ChatGPT. But folks, this year, AI isn’t just about words — it’s about sound, images, and a whole new universe of possibilities! Brace yourself because AI is coming in your lives and there to stay.

If, for any reason, you couldn’t keep up with all the AI news and updates this year, missed any essential information, or simply want a recap of what happened this year in AI, this video is for you. I also have a complete video explaining each advancement shared in this recap if you are curious to learn more.

Let’s get into it…

Watch the recap video:

In this video, we discuss all the most important papers and news this year. You can find a detailed list below. You can find an even more detailed list here with all the links to a detailed video and article for each news ad research.

January

VALL-E is capable of imitating a person’s voice with just a 3-second sample. This new technology significantly advances voice synthesis, achieving unprecedented levels of realism and naturalness. This development could revolutionize how we produce and interact with digital media, blending text and voice generation to create entirely synthetic yet realistic human representations.
InstructPix2Pix, a creation by Tim Brooks and his team, edits images using text instructions. This AI model, trained with a dataset generated by GPT and Stable Diffusion, enables precise text-driven image modifications. It combines the understanding of text and image context, allowing fast and accurate edits.
MusicLM creates music from text descriptions using Transformer-based models, similar to the GPT approach. It transforms text descriptions into rich, AI-generated musical pieces. This innovation is further supported by MusicCaps, a dataset they released with 5.5 thousands music-text pairs, enhancing the model’s performance and future AI music generation.

February

GEN-1, the latest innovation from Stable Diffusion’s creators, stylizes videos based on text or image prompts. It intelligently edits specific elements within a video, like altering a dog’s appearance, while preserving the video’s overall structure. GEN-1’s ability to combine structural understanding with content adaptation marks a significant advancement in AI-driven video editing.

March

PaLM-E blends image and text understanding with robotics, enabling a robot to execute tasks based on textual and visual commands. It integrates vision transformers and language models to interpret and react to its environment demonstrating AI’s growing capability to understand and interact with the real world.
On March 14 2023, OpenAI changed everything releasing GPT-4. GPT-4 is known for being more reliable and creative than its predecessor, GPT-3.5, and can handle more nuanced instructions. It demonstrates improved performance in various areas, including coding assistance and standardized tests. However, GPT-4 retains some limitations from earlier versions like hallucinations and lack of logic or real intelligence.
Then just a bit later, Google officially released Bard, its answer to ChatGPT. Bard is built on Google’s advanced large language model, LaMDA, and is designed to be a creative collaborator, generating ideas and providing assistance on various topics. It’s pretty much the same as ChatGPT and had access to internet earlier and other features they are constantly developing. A great alternative to try for sure.

April

Meta’s Segment Anything Model, or SAM, revolutionizes image segmentation with its prompt-based approach. Trained on a vast dataset, SAM can efficiently segment objects in images or videos using text or spatial prompts. This model streamlines segmentation tasks, making it highly effective for various applications, particularly in situations requiring quick adaptation to new objects without retraining. It’s a first amazing attempt for a generalist or foundational model for segmentation.
LLaVA is an innovative language vision model that utilizes GPT-4 for dataset generation. This model uniquely understands both visual and language instructions, combining the strengths of LLaMA for language processing and CLIP for image understanding. Through visual instruction tuning, LLaVA learns to answer questions about images without depending on captions.

May

NVIDIA’s Perfusion model introduces offers superior control in image generation. It excels in accurately incorporating specific objects into new images, enhancing the fidelity of generated content. This model is a leap forward in creating personalized and contextually relevant visual content.
Drag Your GAN” introduces a novel method for image editing, allowing users to realistically manipulate images by dragging points within them. This AI model, employing GAN architecture, enhances image editing by enabling changes in object positions while maintaining image realism. This innovation simplifies complex editing tasks, making them more accessible and intuitive.
Geoffrey Hinton, a pioneer of deep learning and often called the godfather of AI announced his departure from the company after a decade. He expressed his concerns about the technology he helped create and wished to openly discuss these issues.

June

NVIDIA’s Neuralangelo builds on the Instant NeRF model to create 3D scenes with enhanced surface details and realism. It improves upon Instant NeRF’s limitations in texture and fine structure, making the generated 3D models more lifelike and detailed.
TryOnDiffusion enhances virtual try-on experiences. It uses advanced AI to realistically superimpose clothing items on a person’s image, addressing previous limitations in virtual try-ons. This model represents a significant improvement in creating accurate and lifelike representations of clothing on different body types, offering potential commercial applications in online shopping and fashion.

July

StyleGANEX, an advancement of NVIDIA’s StyleGAN model, enables more flexible face manipulation in images regardless of resolution. This innovation significantly enhances the ability to manipulate and generate faces across a variety of resolutions, streamlining the process and expanding the model’s applicability.
3D-LLM marks a significant advancement in AI by understanding our world in three dimensions and language. It processes 3D point clouds and text, offering a more comprehensive understanding of real-world environments and objects. It represents a leap towards more interactive and realistic applications, bridging the gap between digital and physical worlds.
Meta’s LLaMA-2, the successor to the initial LLaMA model, has been widely embraced, with over 30 million downloads of LLaMA-based models. It is an open-source version of the GPT models which can have access to with great capabilities though not really comparable to GPT-4.

August

MetaGPT innovates by using large language models as agents in a structured workflow, reducing hallucination risks and improving task efficiency. This approach allows complex tasks to be managed with precision, paving the way for more advanced and automated AI systems.
MVDream takes text-to-3D model generation to a new level of realism and complexity. By understanding physical attributes from text input, it creates high-quality 3D models that accurately represent objects in the real world.

September

DALL·E 3, a big advancement over DALL·E 2, excels in transforming complex prompts into detailed images with its improved image captioner. This innovation leads to more accurate and context-rich visualizations. However, the model still faces challenges in spatial awareness and text generation within images. You can try it now with ChatGPT plus and it’s worth it.

November

Distil-Whisper, a streamlined version of OpenAI’s Whisper, offers efficient audio transcription, being 6 times faster and 49% smaller while retaining 99% accuracy. Achieved through knowledge distillation from Whisper, it reduces training data needs significantly. It’s a substantial step in making voice-to-text conversion more accessible and practical for everyday use.
Stable Video Diffusion, a new model from Stability AI, extends stable diffusion technology to video generation. It generates realistic video sequences from text or image inputs using added temporal layers. While adept at short videos, it still faces challenges with longer sequences. This model is a notable step in AI-driven video creation.
Elon Musk announced the introduction of “Grok,” an AI chatbot developed by his startup, xAI. This chatbot, designed for use with X (formerly known as Twitter), features a sarcastic sense of humor akin to Musk’s own. If you like Elon, you’ll like Grok. Grok is intended to rival other AI chatbots like ChatGPT and is described as having a bit of a rebellious streak and less guardrails and pro free speech.
At OpenAI’s first developer conference, major releases included the introduction of GPT-4 Turbo, supporting a 128K token context window for handling extensive text, and the launch of Assistants API for building complex AI applications. Additionally, the conference unveiled a marketplace for sharing and monetizing custom GPT models, tailored AI models for organizations, forcing many startups to pivot or shut down.
The controversy at OpenAI began with the unexpected firing of CEO Sam Altman, followed by the departure of the director of machine learning Greg Brockman. This sparked employee outrage, leading to a petition for Altman’s reinstatement. The pressure from employees and stakeholders resulted in Altman’s return and the subsequent dismissal of the board, illustrating significant governance challenges within the organization but also an amazing team spirit.

December

Google Deepmind released Gemini to compete with GPT-4. Unfortunately, they only released a video for their best model and not the model itself. It was criticized for misleading viewers about the model’s real-time capabilities. It was revealed that the video was not a live demonstration but was carefully produced using text prompts and still images, leading to questions about Google’s transparency in portraying its AI technologies.
Channel 1, an LA-based news station set to launch in 2024, plans to use AI-generated news anchors, providing a personalized news experience that adapts to viewer preferences thanks to automatic translation, news summaries and other AI-based features aimed to revolutionize the news industry, with a commitment to transparency and accuracy in AI-generated reporting.

Watch the recap video:

Sign up for more like this.