AI Agents

TexShape: Information Theoretic Sentence Embedding for Language Models

Abstract

With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper focuses on the textual domain of data and addresses challenges regarding encoding sentences to their optimized representations through the lens of information theory. In particular, empirical estimates of mutual information are used, leveraging the Donsker-Varadhan definition of Kullback-Leibler divergence. The approach trains an information-theoretic sentence embedding, called TexShape, for (task-based) data compression or for filtering out sensitive information, enhancing privacy and fairness. The study employs a benchmark language model for initial text representation, complemented by neural networks for information-theoretic compression and mutual information estimations. Experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios, in terms of predictive accuracy of downstream models trained using the compressed data.

Introduction

The domain of machine learning, particularly that of transformers and language models, has witnessed explosive growth. Technologies such as ChatGPT, Bard, LLaMA, and others have produced significant strides in natural language processing. However, the connection between theoretical aspects of statistics and machine learning is currently poorly understood. In particular, we understand very little about the connections between information-theoretic quantities and concepts within generative ML. Since its inception, information theory has formed the basis on which all data processing systems, including communication, signal processing, and compression, have been benchmarked. This paper is structured to be a very early step in this direction for large language models, with the goal of developing optimized and succinct text representations for ML through the use of information-theoretic tools. The combination of information theory and ML forms a perfect union, as information theory provides insights into compression and representation of data, while ML models generally desire compressed representations for improved performance at lower complexities.

Preliminaries and Problem Statement

The core problem is to design an encoder $T_\Theta : \mathcal{X} \to \mathcal{Z}$ such that when applied to each sentence $X \in \mathcal{X}$ , the encoded sentence satisfies certain theoretical information properties. The encoder's trainable parameters are optimized as follows:

$\max_\Theta \gamma \mathcal{I}(T_\Theta(\mathbf{X}); \mathbf{X}) + \sum_{i=1}^{N_l} \lambda_i \mathcal{I}(T_\Theta(\mathbf{X}); L_i(\mathbf{X})) - \sum_{j=1}^{N_s} \mu_j \mathcal{I}(T_\Theta(\mathbf{X}); S_j(\mathbf{X}))$

where $L_i(X)$ and $S_j(X)$ represent the $i$ -th relevant (public) feature and $j$ -th irrelevant (sensitive) feature of $X$ , respectively. The parameters $\{\gamma, \lambda_1, \ldots, \lambda_{N_l}, \mu_1, \ldots, \mu_{N_s}\}$ take non-negative values and are tuned to enable a desired balance between competing goals of the optimization problem, and the cardinality of $\mathbf{Z}$ determines the encoder's compression level. Mutual information (MI) is estimated using the Donsker-Varadhan definition of KL divergence:

$\mathcal{I}(\mathbf{X}; \mathbf{Z}) = \sup_{F:\Omega \to \mathbb{R}} E_{P_{\mathbf{X},\mathbf{Z}}}[F(\mathbf{X}, \mathbf{Z})] - \log(E_{P_\mathbf{X} P_\mathbf{Z}}[e^{F(\mathbf{X}, \mathbf{Z})}])$

This formulation allows for three distinct operating modes: task-agnostic information-theoretic compression (when $\lambda_i = \mu_j = 0$ ), task-oriented filtering of sensitive information (when $\lambda_i = 0$ ), and a privacy-utility trade-off (when $\gamma = 0$ ).

Designing TexShape

The TexShape framework includes two main components: (1) the sentence encoder, which is a non-trainable off-the-shelf sentence embedding (pre-trained MPNet) fed into a trainable deep neural network (DNN) that projects the original 768-dimensional embedding into a lower-dimensional space with desired information-theoretic properties; and (2) the information-theoretic evaluator, consisting of MI estimators for each term in the optimization objective. The MI is empirically estimated using the Donsker-Varadhan formulation, where the supremum function is modeled with a neural network parameterized with trainable weights. Optimization is performed using SGD, where at each step a batch of samples from a public dataset is used to estimate expectations and their gradients, with the maximum converged value indicating the MI. Training separates sentence encoder epochs from MI estimator iterations, with the encoder initialized randomly and updated each epoch to improve its information-theoretic properties.

Simulation Results

Experiments are conducted on three public text datasets: Dataset A (Stanford Sentiment Treebank / GLUE SST2) with 67,349 movie reviews and binary sentiment labels; Dataset B (Corona-NLP) with Covid-related tweets annotated with sentiment and location labels; and Dataset C (MultiNLI) with 160,698 premise-hypothesis sentence pairs labeled for logical relationships. The TexShape encoder uses a pre-trained MPNet model producing 768-dimensional vectors, with a trainable DNN (768 inputs, two hidden layers with 512 and 256 nodes, ReLU activation) and an output layer whose cardinality determines compression level. For the privacy-utility trade-off, TexShape trained with $\mu = 0.4$ achieves AUROC of approximately 0.51 for the private label (near random chance), while preserving public label AUROC comparable to the original embedding. For information-theoretic compression on Dataset C, TexShape embeddings at size 128 achieve Label 1 accuracy of 61.0% and Label 2 accuracy of 95.3%, outperforming random embeddings (51.1% and 90.8%) at the same compression level. For fairness, TexShape reduces bias from 0.5832 to 0.1939 on Dataset A and from 0.5552 to 0.0711 on Dataset B, at the cost of only a small decrease in AUROC.

Conclusion

This paper integrates fundamental concepts from information theory and machine learning and proposes a semantic approach for data processing in language models. The applications include lossy compression, privacy-preserving data sharing, and training unbiased models. The proposed design objective, based on a weighted linear combination of data processing goals such as resource utilization, task-specific utility, privacy, and fairness, stands as a foundation for further exploration. Future directions include expanding the linear combination to more general forms, optimizing the weights as hyperparameters informed by theory, and bridging design choices with desired performance metrics.

Stay Informed

Learn about company and product updates, upcoming events, rewards, and more.