Algorithm based on LLMs doubles lossless data compression rates

NoSpotOfGround@lemmy.world · 4 months ago

Algorithm based on LLMs doubles lossless data compression rates

𝔻𝕒𝕧𝕖@lemmy.world · 4 months ago

Can’t wait to find hallucinated data in your uncompressed files.

AbouBenAdhem@lemmy.world · edit-2 4 months ago

The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

Great… but if that’s the case, maybe the user should reconsider the usefulness of transmitting that data in the first place.

andallthat@lemmy.world · edit-2 4 months ago

I tried reading the paper. There is a free preprint version on arxiv. This page (from the article linked by OP) also links the code they used and the data they tried compressing, in the end.

While most of the theory is above my head, the basic intuition is that compression improves if you have some level of “understanding” or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you’ll probably transmit that information more efficiently than a compression algorithm. “The first chapter of Moby-Dick”; there … I just did it.

skip0110@lemm.ee · 4 months ago

This is not new knowledge and predates the current LLM fad.

See the Hutter prize which has had “machine learning” based compressors leading the ranking for some time: http://prize.hutter1.net/

It’s important to note when applied to compressors, the model does produce a code (aka encoding) that exactly reproduces the input. But on a different input the same model is unlikely to produce an impressive compression.

Alphane Moon@lemmy.world · edit-2 4 months ago

I found the article to be rather confusing.

One thing to point out is that the video codec used in this research (but for which results weren’t published for some reason), H264, is not at all state of the art.

H265 is far newer and they are already working on H266. There are also other much higher quality codecs such as AV1. For what it’s worth, they do reference H265, but I don’t have access to the source research paper, so it’s difficult to say what they are comparing against.

The performance relative to FLAC is interesting though.

InvertedParallax@lemm.ee · edit-2 4 months ago

Vvc is h266, the spec is ready it’s just not in a lot of hardware, or even decent software yet, that often takes a few years. The reference implementation encodes at like 1fps or less, but reference software is usually slow as hell in favor of correctness and code comprehension.

Av1 isn’t much better than hevc (h265), it’s just open and patent free and Google is pushing it like crazy.

It has iirc 1 major feature over hevc, non-square subpictures, beyond that it has some extensions for animation and slideshows basically.

paraphrand@lemmy.world · 4 months ago

I wonder what the practical reasons for starting with h.264 are.

futatorius@lemm.ee · edit-2 3 months ago

deleted by creator

tekato@lemmy.world · 4 months ago

Interesting how they forgot to go over the architecture for LMDecompress.

fluxion@lemmy.world · 4 months ago

Middle-LLM compression

Kalvin@lemmy.world · edit-2 2 months ago

Removed by mod