Selecting Embedding Models - Cisco.com
Selecting Embedding Models - Cisco.com
The Role of Embeddings in Generative AI
Our paper, Securing Vector Databases, introduces the concepts of embeddings in machine learning and AI implementations. Vector embeddings are mathematical representations of objects—typically words or other data points—in a continuous, multi-dimensional space. These embeddings are designed so that the position of each vector (a point in this space) reflects the semantic or contextual relationships between the objects they represent.
If you are looking for more details, kindly visit our website.
The selection of an appropriate embedding technique and embedding model should be informed by several factors, including available computational resources, dataset scale, security, privacy requirements, and the desired level of accuracy. Each method presents its own set of trade-offs, and the optimal choice will depend on the specific requirements of the task at hand.
Types of Embeddings Models
Embedding models generate embeddings, which are vector representations that capture the semantic meaning of data. These models transform input data (like text or images) into a format that machine learning algorithms can effectively process. There are many types of embedding models, ranging from traditional implementations to more advanced neural network models and transformer models.
Term Frequency-Inverse Document Frequency (TF-IDF) and Principal Component Analysis (PCA) are examples of traditional embedding methods. TF-IDF assesses the words within a document corpus and determines their relative importance for text mining or information retrieval tasks. PCA is a dimensionality reduction technique that preserves maximal variance while minimizing the feature set, thereby facilitating data visualization and noise reduction.
Figure 1 is a heatmap that was created using the Python seaborn library to visualize the TF-IDF scores. The features of this heatmap represent the following information:
- Terms (x-axis): These words were extracted from the documents and identified as important based on their TF-IDF scores.
- Documents (y-axis): Each row corresponds to a document from the sample data of Omar’s Document Collection, which includes information about different vulnerabilities.
- Color Intensity: Darker colors indicate higher TF-IDF scores, meaning the term is more important in that document. (SQL injection was the main topic in those documents.)
Figure 1: Heatmap of TF-IDF Scores
Neural network embedding models, such as Word2Vec and FastText, are more sophisticated than traditional models. These embeddings represent words or phrases in a fixed-size vector space, where words that are semantically similar are positioned closer together. For example, Word2Vec uses a shallow neural network to map words into a continuous vector space, where semantic similarity is reflected in vector proximity. FastText was developed by Facebook's AI Research team and builds upon Word2Vec by incorporating subword information. The FastText GitHub repository was archived in because newer and more powerful models now exist.
Advanced transformer models are used to generate context-dependent embeddings. Their embeddings incorporate attention mechanisms that allow them to weigh the relevance of different words in a sentence based on their context, leading to context-aware embeddings. These models—such as Cohere, OpenAI, GPT, and others that are featured in the Massive Text Embedding Benchmark (MTEB)—produce more nuanced and flexible representations that are particularly suited for complex language understanding tasks.
Commercial Tools
There are commercial tools available for embeddings. In addition to ChatGPT, Open AI also provides embedding models for different applications. Cohere is a company that specializes in creating embedding models and other models for AI applications.
Cohere's Embed service supports embedding models that can generate vector representations of text or classify text according to various criteria. When coupled with classification tools such as their Classify endpoint, these embeddings become even more powerful. They can be applied to a wide range of classification and analytical tasks.
Figure 2: An Example of Embeddings Created Using Cohere’s Playground
Hugging Face Embedding Models
Hugging Face offers a wide range of embeddings covering text, image, audio, and multimodal data from various models. These models can be finetuned using custom data to generate task-specific embeddings. However, some features require logging in, and the platform is less flexible than open-source options.
How Vector Embeddings Work
In vector embeddings, each object (e.g., a word, phrase, sentence, or image) is converted into a vector of numbers. This vector is a fixed length and typically has dozens or hundreds of dimensions, depending on the complexity of the model.
Figure 3: Embeddings Projector from the TensorFlow Project
An example of a graphic representation of high-dimensional embeddings
The key idea behind embeddings is that objects with similar meanings or functions will have similar vector representations. For example, in the context of word embeddings, the vectors for "CVE" and "vulnerability" would be closer to each other in the embedding space than "knight" and "server-side."
Vector embeddings are usually learned through machine learning models. For example, large corpora of text may be used to train word embeddings to predict a word based on its context (e.g., surrounding words). Through this process, the model learns to position words with similar meanings close to each other in the vector space.
External Use vs. Internal Use
Large language models (LLMs) and small language models (SLMs) internally convert their inputs into vector embeddings to process and generate responses. These embeddings are typically confined to the model's internal computations and are not exposed for external use. The key distinction with dedicated embedding models lies in their purpose and utility: embedding models are specifically designed to produce embeddings that are useful outside the internal operations of a single LLM.
Embedding models generate vector representations that capture the semantic essence of the input data, making them highly valuable for a variety of external applications. For instance, these embeddings can be used in semantic search, recommendation systems, clustering, classification, and cross-modal retrieval.
In contrast, the embeddings within an LLM are transient and optimized for the model's immediate task of understanding and generating language, without the necessity for external applicability. Embedding models, therefore, serve as a bridge between raw data and various downstream applications, enabling interoperability and facilitating tasks that require a deep understanding of semantic relationships.
Understanding Cosine vs. Euclidean Distance in Embeddings
Figure 4 demonstrates how semantically similar inputs result in embeddings that are closer in vector space. As the sentences become more similar to the first, the cosine similarities of their embeddings gradually increase. Although it has an entirely different structure than the first sentence, the last sentence has the most similar semantics. Therefore, its embedding is closest to the first sentence in vector space.
Figure 4: Code Example of Semantic Similarity and Proximity
Two popular methods for measuring the similarity or distance between embeddings are cosine similarity (or distance) and Euclidean distance. Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It is calculated using the dot product of the two vectors divided by the product of their magnitudes:
cosine_similarity(A, B) = (A · B) / (∣∣A∣∣ ∗ ∣∣B∣∣)
Where:
A · B is the dot product of vectors A and B
∣∣A∣∣ and ∣∣B∣∣ are the magnitudes (lengths) of vectors A and B
Cosine distance is simply 1 minus the cosine similarity:
cosine_distance(A, B) = 1 - cosine_similarity(A, B)
If you want to learn more, please visit our website Aps Nesswell.
Cosine similarity ranges from -1 to 1, where 1 indicates identical direction, 0 indicates orthogonality, and -1 indicates opposite directions. Cosine distance ranges from 0 to 2.
Direction-sensitive measures evaluate the orientation of vectors by focusing on the angles between them, disregarding their magnitudes. For example, let’s say you have two arrows on a graph, each pointing in a certain direction with a certain length. Direction-sensitive measures pay attention only to which way the arrows are pointing, not how long they are, and focus on the angle between the arrows to see how their directions compare. These measures ignore length, even if one arrow is much longer than the other.
If both arrows are pointing straight up, they have the same direction, even if one is longer. If one arrow points up and the other points to the right, the angle between them shows that their directions are different. So, when we use direction-sensitive measures, we're interested in the orientation of the arrows (which way they point) and not their magnitude (how long they are).
Figure 5: 3D Visualization of the Cosine Similarity of Vector Embeddings with PCA
Euclidean distance is the ordinary straight-line distance between two points in Euclidean space. It is calculated using the Pythagorean formula:
Euclidean_distance(A, B) = sqrt(sum((A_i - B_i)^2))
Where A_i and B_i are the i-th components of vectors A and B
Euclidean distance is always non-negative and can range from 0 to infinity. It is a magnitude-sensitive component that considers both direction and magnitude of the vectors.
When to Use Cosine vs Euclidean Distance
Use cosine similarity when you are more interested in the orientation (direction) of the vectors than in their magnitudes. You can also use it when working with text documents of varying lengths because cosine similarity normalizes for document length.
Use Euclidean distance when the magnitude of the vectors is important for your analysis and when you are working in lower-dimensional spaces where Euclidean distance is more intuitive.
There's no one-size-fits-all solution, and sometimes it is worth experimenting with both measures to see which one performs better for your particular problem.
Selecting and Securing Embedding Models
Selecting the right embedding model depends on several factors, including the use case, computational requirements, security, and privacy. Embedding models are key to effective retrieval augmented generation (RAG). They enable the system to understand and match the semantics of queries with relevant documents, enhancing the generation of responses in tasks like question answering or in AI conversational systems.
Embedding models significantly improve the accuracy and relevance of search results by capturing the essence of words or phrases in a continuous vector space. But there can be challenges. Word embeddings often suffer from data sparsity, where infrequent words or phrases are poorly represented. Over time, the meanings of words can change, a phenomenon known as semantic drift. This can cause embeddings to become outdated, reducing their effectiveness in capturing current semantic relationships.
The following are some considerations when selecting embedding models:
- Choose an embedding model that is well-suited for your specific task. For instance, if you're working on sentiment analysis, models that capture contextual nuances like OpenAI’s embeddings, Cohere, and newer models might be more appropriate than traditional word embeddings like Word2Vec.
- Select the appropriate model for domain-specific tasks. For specialized fields like networking and cybersecurity, or for medical and legal texts, domain-specific embeddings that are trained on relevant data can provide better performance than general-purpose models.
- If your application deals with niche vocabulary or requires specific nuances, training your own embeddings might be necessary.
- If your application needs to handle multiple languages, consider models that support multilingual embeddings.
- Higher-dimensional embeddings can capture more information but at the cost of increased computational resources. Selecting the optimal dimensionality involves balancing performance with efficiency.
- High-dimensional embeddings can also lead to increased computational complexity, making them less practical for real-time applications.
- Consider the licensing terms for each model. Open-source models are generally free to use, while proprietary models may offer better performance, but at a cost.
Sensitive Data and External Services
Embedding models often require sending data to external servers for processing, raising privacy concerns. When you send data to an external service for embedding, that data leaves your controlled environment. Some model providers may use your data to further train or improve their own models. This means your data could indirectly contribute to the model's knowledge base. If your data contains confidential or personally identifiable information (PII), sending it to an external service could violate privacy regulations or company policies.
Further, a security breach of the service provider could result in the compromise of your sensitive data. Data is valuable, so model service providers are inclined to retain it for as long as feasible. A compromise of the service provider that occurs years in the future could result in your sensitive data being exposed: a painful reminder that your embedding model once leaked sensitive data.
Note: It's important to understand the difference between embedding models and AI models that are used for inference and the security concerns for each. If an embedding model is trained on sensitive data (like source code or confidential company documents), it can inadvertently memorize and expose that data through the embedding process. This can lead to leakage when those embeddings are used or shared, potentially allowing unauthorized access to sensitive information. AI models used for inference can also introduce risks if they use embeddings that contain sensitive information. If an adversary can exploit these models, they might extract sensitive data through techniques such as model inversion attacks.
Embeddings Inversion Attacks
Embeddings are essentially a machine representation of the original data, meaning they can be as sensitive as the data used to create them. This equivalence makes them a prime target for data theft and privacy attacks. The purpose of encryption is to make it difficult to obtain information about an input; the purpose of embeddings is exactly the opposite—to provide as much information about the input as possible!
Embeddings can be vulnerable to inversion attacks, where attackers can reconstruct the original data from the embeddings. (See Understanding Privacy Risks of Embeddings Induced by Large Language Models for an example.) This introduces a significant risk, especially for sensitive data like source code, intellectual property, or personal information.
In Text Embeddings Reveal (Almost) As Much As Text, Morris et al. explain how they trained a model to recover up to 92% of the original text from embeddings. Notably, they were able to recover PII from embedded clinical notes.
Figure 6 demonstrates the usage of ielabgroup’s Hugging Face instance of Morris’s Vec2Text model. The tokens were encoded with the sentence-transformers/gtr-t5-base model. The encoding was reversed with a model specifically trained to reverse encodings from gtr-t5-base back to their original tokens.
In this example code, we embedded some information about Justin, including his name, a password, the company he works for, and his favorite color. The output is garbled because some of the structural information about the order of words in the text was lost in the process of encoding the sentence to an embedding and back. However, the final output contains enough of the original information that it wouldn’t take an attacker long to learn Justin’s name, guess his password, identify the company he works for, and see his favorite color based off of the reversed embedding. This example shows that enough information can be revealed from embeddings to be useful to a bad actor and harmful to your organization.
Figure 6: Using a Vec2Text Model to Reverse Encodings
Because they are designed to encode as much information as possible from their original inputs, you must treat the sensitivity of embeddings as equivalent to the data from which they are derived.
Are you interested in learning more about embedding center? Contact us today to secure an expert consultation!
How to Choose the Best Embedding Model for Your LLM Application
- 69
- 0
- 0



