The AI Tool That’s Changing the Way We Work: An Overview of NotebookLM

Tags:

December 4, 2024

S4E

Introduction to NotebookLM’s AI Model

NotebookLM is an AI-powered tool developed by Google to enhance productivity and research by integrating generative AI capabilities with users’ personal notes and documents. It uses Gemini 1.5’s capabilities. It is designed to act as an intelligent assistant for information retrieval, summarization, and contextual question-answering based on user-provided content.

We can classify AI models based on different categories like;

Their purpose (generative AI, predictive AI, descriptive AI)
Architecture (Transformer-based, Convolutional Neural Network, Recurrent Neural Network)
Data types they use (image, text, audio, or combining of these data types)
Their training paradigm (supervised, unsupervised, reinforcement learning)
Their parameter scales (how big the training data is)
Their focuses (specialized on a topic or general).

NotebookLM is a specialized generative transformer AI model.

Generative AI Models

Generative AI Models are the algorithms that use input data, like image, text, or audio, to create a new content. Based on the purpose of the AI, different algorithms can be used. For the text understanding and text generating purposes, the Transformer-based Model is used in Generative AI.

Training Paradigms

Generative AI algorithms are trained with either unsupervised learning or semi-supervised learning. The difference between supervised and unsupervised learning is that supervised learning has an all labeled dataset (real people label the correct answers of each data before the AI is trained), whereas unsupervised learning doesn’t have labeled dataset. With supervised learning, the AI algorithm can check its accuracy by comparing its own outcomes with labels. With unsupervised learning, the AI algorithm learns the differences among the different data by itself. Semi-supervised training paradigm has a small amount of labeled dataset and a larger amount of unlabeled dataset, so it is a mix of supervised learning and unsupervised learning. [1]

NotebookLM follows a semi-supervised learning paradigm, like most of the LLMs (Large Language Models). It is initially pre-trained with an unlabeled dataset, then trained with a smaller labeled dataset for specific use cases, like answering questions, summarizing text, etc.

Discriminative vs. Generative Modeling

There are two modeling types in generative AI models: “Discriminative modeling is used to classify existing data points (e.g., images of cats and guinea pigs into respective categories). It mostly belongs to supervised machine learning tasks. Generative modeling tries to understand the dataset structure and generate similar examples (e.g., creating a realistic image of a guinea pig or a cat). It mostly belongs to unsupervised and semi-supervised machine learning tasks.” [2].

AI algorithms, which are using discriminative modeling, are trained with the supervised training paradigm. So, if we want to build an AI to discriminate the input from each other (e.g., photos of animals), AI will learn how to discriminate them by figuring out the differences (e.g., figuring out the differences of animals by their physical differences). It checks its accuracy during the training since the dataset is labeled with the correct answer.

AI algorithms using generative modeling (NotebookLM uses this modeling) aren’t trained with the supervised training paradigm, because they aim to generate an original output that fits the context of the input (so, there is not even a single correct answer that can be labeled). During training, it learns the structure of data (e.g., text) like grammar, syntax, word relationships, and how different concepts are connected. AI creates a distribution based on the structure of the data. Since the AI algorithm is pre-trained with huge amounts of data, it can match the input with the learned data based on the learned structure. Once the AI algorithm finds the matched context of the input in the training data, it samples from the distribution it has learned. This means the output can vary, but it will always follow the same patterns and structure seen in the training data.

Illustration comparing Discriminative and Generative models: Discriminative model classifies images as cat or dog, while Generative model creates a cat image from noise.

Figure 1: “A simple illustration of how one can use discriminative vs generative models. The former learns to distinguish between two classes, i.e., pictures of cats or dogs. The latter estimates the underlying distribution of a dataset (pictures of cats) and randomly generates realistic, yet synthetic, samples according to their estimated distribution.” [3]

For example, if your prompt is “What is the capital of Türkiye?”, the model has learned that, based on the structure of language data it was trained on, the answer “Ankara” fits the learned distribution of question-answer pairs. The model generates new text based on the structure it has learned from many other instances of questions about cities.

Transformer-based Models

NotebookLM uses a Transformer-based model so that it can generate contents depending on the inputs (sources and user prompts). Here is how the Transformer-based architecture works:

Diagram of transformer-based architecture showcasing input processing through tokenization, embedding, positional encoding, self-attention, feedforward layers, and output generation using the softmax function.

Figure 2: Transformer-Based Architecture [2]

Tokenization is dividing the prompt into pieces like words, subwords and special characters.
Each token is converted into multi-dimensional vectors called embeddings. Each vector represents the semantics (linguistic features including meaning, synonyms, connotations, …) of a word. Similar words are located in the similar vectors.
Positional encoding is locating the order of the words in the prompt. This is represented in a vector too. Both vectors coming from the embeddings and the positional encoding are combined and represented in a single vector.
Self-attention mechanism computes the contextual relationship between tokens. The computation is like computing the angles and distances between vectors. It also computes the prioritization of tokens. For example, in the sentences “I give the pizza slices from John to Daniel, because he is hungry.” and “I give the pizza slices from John to Daniel, because he is full.”, the word “he” initially refers to Daniel and secondly the word “he” refers to John.
Feedforward network works independently on each of the tokens and allows the AI model to understand more complex patterns.
Self-attention mechanism and feedforward network works again and again through stacked layers until the last outputs are generated.
The softmax function considers the possible different outputs and chooses the most probable one [2]

There are two types of Transformer-based architecture: Decoder-only Transformers and Encoder-Decoder Transformers. Decoder-only Transformers are used for pure text generation and their aim is learning patterns in text data and generating coherent outputs based on learned probabilities. However, Encoder-Decoder Transformers are used for contextual generation and their aim is using context from specific sources (e.g., user-uploaded documents). It actively integrates knowledge from these sources, unlike typical generative models that rely solely on pre-trained knowledge. Since NotebookLM creates contextual outputs, we can say it is built with Encoder-Decoder Transformer (based on PaLM2 Model).

Both Encoder and Decoder have layered structure. Encoders create a representation about the meaning of each token and Decoders generate the outputs token by token based on the specifications gathered from the Encoders. So basically, Encoder is a self-attention mechanism applied layer by layer, and Decoder is a feedforward network applied layer by layer.

Diagram of an encoder-decoder architecture illustrating the flow of inputs through multiple encoders to extract features and then through multiple decoders to generate outputs.

Figure 3: Encoder-Decoder Transformer Model [4]

How Is NotebookLM Different?

NotebookLM is a LLM but unlike chatGPT, it has a specialized focus based on the sources that users upload. Users can upload sources so that they can get summary, FAQ, timeline and briefing notes of their sources. These sources can be Google Documents and Slides, PDF files, text files, a direct copy of a text, a web URL and a YouTube video (NotebookLM doesn’t understand voice inputs, it understands YouTube videos from their transcription, if exists). It can understand the images inside those documents. It can have at most 50 sources in a notebook (session) and each source has to be less than 500.000 words. It can also create a PodCast-like audio which two people talk about the sources users give.

List of selectable sources in a cybersecurity platform interface, including options like Continuous Threat Exposure, Detection Engineering Tools, and Platform

Figure 4: Sources can be seen on the left section of the Notebook

NotebookLM temporarily stores the information it gathered from the sources and indexes them to efficiently retrieve them. If multiple sources are uploaded, the model links them contextually by identifying relationships. When the user enters a prompt into the NotebookLM, that prompt is processed with tokenization. It creates embeddings by combining the semantics of each token with the positional encodings so that each token can be represented by a vector. Then the gathered vector is compared by looking at how close they are (e.g., cosine similarity is beneficial for text comparison [5]) with the previously created vectors of the tokens from the uploaded documents. Related parts of the documents are extracted based on the similarity. The prompt embeddings and embeddings from extracted parts of the documents are sent to the decoder to generate contextual output.

Interface showcasing Continuous Threat Exposure Management (CTEM) insights with features like FAQ, Study Guide, Summary, and Suggested Questions for cybersecurity enhancement.

Figure 5: Users can create FAQ, Study Guide, Table of Contents, Timeline, Briefing Docs, Summary, Summary Audio and more prompts with Suggested Questions

Summaries are created by filtering out the details and identifying the key points in the sources. Key points can be identified by looking at the frequency of the key terms, words related to the headings and sentences that summarizes the content. The underlying model is fine-tuned for specific tasks like summarizing, creating timeline, etc. Therefore, it knows how to implement the explained techniques above.

Overview of how S4E ensures customer security using Continuous Threat Exposure Management (CTEM), including features like continuous monitoring, attack surface management, and AI-based risk prediction.

Figure 6: You can ask questions related to sources

Explanation of Continuous Threat Exposure Management (CTEM) as a proactive cybersecurity approach emphasizing continuous identification, assessment, and mitigation of threats and vulnerabilities.

Figure 7: NotebookLM also provides the references it takes and displays them

Timelines are created by initially detecting the time-related content in the sources. Furthermore, events are identified so that it can match the dates and events. NotebookLM cross-references dates and events so that it matches them. Moreover, it can logically connect phrases like “shortly after”, “two months later” and so on. For example, it can connect these two sentences: “Treaty of Lausanne was signed on 24th of July in 1923 and Türkiye gained its independence officially.” and “Three months later, the republic was declared.”. When NotebookLM sorts the events depending on their dates, it will put the second sentence after the first sentence because of the phrase “Three months later”.

Figure 8: A section from a Timeline

NotebookLM can produce PodCast-like audios that summarizes the sources. To understand how NotebookLM does it, we need to understand how TTS (text-to-speech) technologies work. As usual for generative AI’s, prompts and sources are parsed with tokenization and they are converted into readable format (like making abbreviations long format, writing numbers as text). The text is splitted into the smallest units of sounds in the language. These units are matched with the sounds which are pre-recorded by people. “The output from text analysis is passed into linguistic analysis for refining pronunciations, calculating the duration of words, deciphering the prosodic structure of utterance, and understanding grammatical information.” [6]. So, TTS configures the intonation, rhythm, stress, pause, pitch, duration and volume of each sound unit to make the speech more human-like. Then, deep learning and neural network models are used to adjust the naturality of the speech. Finally, the speech gets converted to waveform so that it can be played as an audio. [7]

Click to listen the sample audio below;

There are some limitations on the NotebookLM’s PodCast feature. This feature only includes (so far) two people having a conversation about the sources. Users cannot decide on the length (max. 30 minutes), voice options (it has to be a male and a female having a conversation) and audio editing settings. The PodCast feature doesn’t support languages other than English and background music [8]. Moreover users cannot specifically decide on what will be spoken in the audio, it will always summarize the sources.

Alternatives of NotebookLM

Alternatives of tools, especially the open-source ones, provide more flexibility and features. There are a lot of alternatives out there but we will look at a few of the ones we encountered.

Open Notebook

Open Notebook is an open-source tool that users have to set up on their local machines. It supports multiple sessions within a single notebook, allowing for multitasking across different workflows. The tool also includes podcast creation capabilities. Open Notebook is compatible with various AI models, including OpenAI, Anthropic, Vertex AI, OpenRouter, and Ollama, enabling users to select the AI model that fits their needs. You can visit its Github page from here.

Surf Sense

Surf Sense is an open-source tool that users need to set up on their local machines. It can create summaries for each source, similar to NotebookLM. The tool includes an extension that allows users to take snapshots of websites they are looking at and add those website contents to a selected notebook. Users can also search for specific content across their notebooks. Additionally, Surf Sense has the capability to create podcasts. To see its demo usage, click here.

Podcastfy

The tool focuses on creating podcasts. Users can configure podcast settings to make it more personalized, including adjusting the conversation style (engaging, fast-paced, enthusiastic, etc.), roles of participants (main summarizer, questioner/clarifier), dialogue structure (the order of parts), creativity/temperature, minimum characters for each round, and maximum number of rounds. It supports over 156 of HuggingFace’s LLM models and allows images to be used as sources. For more information, click here.