Or an alterative title: "Man uses astrology in desperate attempt to convince himself he's the smarter sibling"


My sister and I are both pretty avid readers, but she handily outreads me by any metric you'd care to measure (words, pages, books, time, etc). To make myself feel better, I'm always giving her shit for reading what — in my humble opinion — sounds like hot garbage. Not that there's anything wrong with reading garbage, but how many romance novels about faeries and witches can one read before you run up against the fundamental limits of the genre?1

To put a finer point on it, here's the description for a book she read recently:

Emilia and her twin sister Vittoria are streghe - witches who live secretly among humans, avoiding notice and persecution. One night, Vittoria misses dinner service at the family's renowned Sicilian restaurant. Emilia soon finds the body of her beloved twin...desecrated beyond belief. Devastated, Emilia sets out to find her sister's killer and to seek vengeance at any cost—even if it means using dark magic that's been long forbidden.

and one more, from a totally unrelated series:

Many centuries ago, desperate to save her dying sister, Dianna made a deal with Kaden, a monster far worse than any nightmare. Locked in servitude to him, she is forced to hunt down an ancient relic held by her most dangerous enemies: an army led by Samkiel, the World Ender.

And they always have titles with "Discount Game of Thrones" vibes, like:

A Fate    so Dark   and Delicate
A Bond    so Fierce and Fragile
A Promise so Bold   and Broken
A Tongue  so Sweet  and Deadly

or

Dawn    of Chaos   and Fury
Tempest of Wrath   and Vengeance
Storm   of Secrets and Sorrow
Rain    of Shadows and Endings

and (these are all real, by the way)

The Wrath  of the Fallen
The Dawn   of the Cursed Queen
The Throne of Broken Gods
The Book   of Azrael

And it got me thinking. Instead of just telling my sister she reads trash, could I mathematically prove she reads trash? And the answer is "no, of course not." That's not something you can prove, especially in the mathematical QED sense.

But here's what I can do: taking inspiration from this great analysis of the vocabularies of hip hop artists, I can attempt to calculate how repetitive + derivative the titles of these books are.

To be clear, I'm not working with the contents of the books, just the titles. Like I said, judging books by their covers, literally.2

Making a Dataset

The idea is that there are metrics I can look at to determine how varied and novelha the book titles are. I've settled on entropy and word frequency, as those seem sensible enough. But in isolation, these metrics will just yield numbers, meaningless and abstract without some baseline to compare them to. Implicit in me calling my sister's books "trash" is the idea that mine are "not trash", that I read sophisticated, elegant, elevated, and complex works of high art. It's me, I'm the baseline.

So I need two lists before I can Do Science: one with the titles of all the (masterpiece, classic) books I've read, and another with the titles of all the (uninspired, drivel) slag my sister consumes.

Conveniently, I already have a list of all the books my sister has read in the last eight years,3 and a self-hosted calibre-web instance with most of my books.

This is a good start! From there, I did some basic data cleaning:

  • Removing book series suffixes - The idea is that we want to see how repetitive the titles themselves are, we don't need the name of the series muddying that up.
    • Rain of Shadows and Endings (The Legacy Book 1) becomes Rain of Shadows and Endings
    • Mostly Harmless (Hitchhiker's Guide to the Galaxy Book 5) becomes Mostly Harmless
  • Removing subtitles - Mostly just because they're usually wordy, and I want the focus to be on the number of unique titles in the set. Whether or not these subtitles are "part of the title" is a philosophical question I'm not going to tackle here.
    • God, Human, Animal, Machine: Technology, Metaphor, and the Search for Meaning becomes God, Human, Animal, Machine
    • How Not to Be Wrong: The Power of Mathematical Thinking becomes How Not to Be Wrong

And then I normalized the length of the corpus in different ways for each analysis.

Analysis #1: Entropy + Binary Compression

My first test is how well the book titles compress using generic methods. The idea is that a more diverse set of titles looks "more random" and compresses poorly compared to a repetitive one. I decided to use the ent tool for this, which tests the randomness of a sequence in a few different ways (check the README for the nitty gritties). We don't need to make the corpuses the same length for this test, as it's mostly4 length-independent.

Starting with myself:

ent brandon.txt

yields

Entropy = 4.849870 bits per byte.

Optimum compression would reduce the size
of this 2252 byte file by 39 percent.

Chi square distribution for 2252 samples is 27196.41, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 89.1545 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is 0.063961 (totally uncorrelated = 0.0).

And now my sister:

ent sister.txt

yields

Entropy = 4.760669 bits per byte.

Optimum compression would reduce the size
of this 5282 byte file by 40 percent.

Chi square distribution for 5282 samples is 66711.87, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 87.9574 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is 0.024103 (totally uncorrelated = 0.0).
CorpusEntropy (bits per byte)Optimal compression (% reduction)Serial Correlation
Brandon (me)4.84987039%0.063961
Sister4.76066940%0.024103

Highlighted values indicate "more random". My list of book titles has more entropy and compresses less optimally, but my sister's book titles have letters that are harder to guess given the previous letter.

We ignore Chi square distribution, as it's more just a "yes/no" of randomness and neither of these are random. And we ignore Arithmetic Mean because it's literally just an average over byte values, which for random data should be (255 - 0) / 2 = 127.5 but here basically just measures "who has more lowercase letters", which isn't particularly interesting.

I don't really understand why the "serial correlation" metric looks the way it does, but I'll take the narrow win.

Winner: Me (barely, 2-1)

Analysis #2: LLM-Based Text Compression

Our book titles are made of sequences of mostly English words. These tend to have more structure than your average sequence of bytes, so I thought it'd be interesting to try a more language-based compression approach. Enter ts_zip by the inimitable Fabrice Bellard, which uses LLMs to compress sequences more efficiently than conventional compressors. I've been looking for an excuse to try this out, and I finally found made one.

For this test, we do need similar length corpuses if we want to look at the absolute compressed size. Since my sister's list had far more books (and bytes overall) than mine, I removed chronologically oldest titles until they were almost exactly the same number of characters:

wc --total=never -c brandon.txt sister_trunc.txt
2252 brandon.txt
2251 sister_trunc.txt

I opted to truncate her list until it had a smaller file size (e.g. removing one more entry rather than one less), to give her an infinitesimal edge and to preempt claims that I'm cooking the books.

./ts_zip c sister_trunc.txt sister_trunc.bin
max_memory=0.33 GB, speed=256 bytes/s (79 tokens/s)
input=2251 bytes, output=435 bytes, ratio=1.546 bpb
./ts_zip c brandon.txt brandon.bin
max_memory=0.33 GB, speed=274 bytes/s (76 tokens/s)
input=2252 bytes, output=423 bytes, ratio=1.503 bpb

Surprisingly, my sister wins this round! I figured that the simpler titles of short, common words would compress better, but it appears not, with my titles compressing ~81.2% to her ~80.6%. It's close, but that's to be expected.

My only hunch for why this might be the case is that in ts_zip, the role of the LLM is predicting sequences of tokens. So, while all the individual words of a title like, sigh — A Tongue so Sweet and Deadly — are common, that specific (and nauseating) sequence is hopefully probably less so. And some made-up words like "Azrael" probably don't compress well, comparatively. Or maybe I just read books that are more popular and better represented in the model's (RWKV 169M v4) training data, who knows!

Winner: Sister

Analysis #3: Number of Unique Words

This technique is most similar to the one from that analysis of hip hop lyrics I mentioned earlier. We'll normalize the data, then count how many unique words appear in it. To make it fair, we need to have the same number of words:

wc --total=never -w brandon.txt sister.txt
370 brandon.txt
943 sister.txt

So we'll look at the first 370 words of each with a bit of Linux shell-fu:

sed "s/'//g" sister.txt |
  grep -oE '\w+' |
  head -n 370 |
  tr '[:upper:]' '[:lower:]' |
  sort -u |
  wc -l

# Prints 209
sed "s/'//g" brandon.txt | ...

# Prints 245

This incantation basically says "remove apostrophes, put each word on a new line, grab the first 370 of them, make it all lowercase, sort it and remove duplicates, then count up the total".

My 370 book title words contain 17.2% more unique words than my sister's, which is unsurprising given that my initial observation was how all the titles were A <blank> of <blankness> and <blankings> or some variation of that.

Winner: Me

Conclusion

So what can we take away from all this? That you can judge a book by its cover? That some books are empirically worse than others based on nothing more than the title?? That I have far too much time on my hands???

Definitely that last one, not so sure about the others.

  1. As someone who reads a lot of sci-fi that could probably be summed up the same way, consider tongue planted firmly in cheek here.

  2. Technically, I'm judging them by their titles, not their covers. I contemplated trying to get the cover art for all the books and doing a visual analysis of them to make the joke hit better, but then I realized that's more effort than I'm willing to invest.

  3. It's because of an igotyouapresent website I made for her, but that's a different story.

  4. Except for the absolute value of Chi square, which we ignore. Theoretically, a longer dataset has more opportunities for compression, but the relative results came back the same way regardless.