HomeAll News

Fast N-gram Tool

One of our goals at WordTree Foundation is to make it easier for anyone with an interest in literary archeology to dive in and test hypotheses. If that’s you, the ngrams command-line tool is something you can use to compare n-grams within books to see if small portions of any two (English) books overlap.

The basic idea is this: if two books (or authors) were inspired by a common source, or if one book (or author) was inspired by the other, then there is likely to be some unusual words or phrases that “stand out,” statistically speaking.

The ngrams tool is a very fast implementation of an ngram counter. It’s written using the Rust programming language, which is known as one of the fastest, safest languages in today’s language ecosystem. It takes ascii encoded text as input, cleans it up, iterates over windows of size N (for whatever value of N you want—e.g. 4 is a reasonable window size), and then counts the N-grams.

Then, using client / server mode you can query for n-grams that match between two or more books.

If you’re curious, this tool is similar to the TextGrams.jl tool we previously published, but uses less memory and therefore can load a larger baseline into memory on a single machine.

See the ngram-tools github page for usage details.

Published 25 Jul 2019

The WordTree Foundation studies the relationships between books, with a special interest in LDS scripture such as the Book of Mormon.
WordTree Foundation on Github