The Future of Text Compression: FineZip Pushing the Limits of Large Language Models


Although the connection between language modeling and data compression has been recognized for some time, current Large Language Models (LLMs) are not typically used for practical text compression due to their lengthy processing times. For example, LLMZip, a recent compression system based on the LLaMA3–8B model, requires 9.5 days to compress just 10 MB of data.
In a new paper FineZip: Pushing the Limits of Large Language Models for Practical Lossless Text Compression, a research team from UC Berkeley and NYU introduces FineZip, a novel LLM-based compression system designed to significantly reduce compression time. By incorporating techniques like online memorization and dynamic context adaptation, FineZip marks an important step toward the practical use of LLMs in lossless text compression.
FineZip’s architecture blends both online and offline elements. Its “online” component employs data-specific fine-tuning to efficiently memorize the content being compressed, while the “offline” portion uses pre-trained LLMs that remain static across different datasets. This dual approach enables the…