Before LLMs, How Did Programs Tell Real Language from Gibberish?

Natural Language Identification in CrypTool-2's Caesar Cipher Brute Force

Abstract

Before large language models (LLMs) existed, cryptanalysis tools faced a very practical problem: brute-forcing a cipher produces a pile of candidate plaintexts — how does the program figure out which one is actual language and which is garbage? This article digs into the source code of CrypTool-2, an open-source cryptography education platform, to dissect the four natural language identification methods it uses in Caesar cipher brute force: Index of Coincidence (IoC), Shannon Entropy, N-gram log-probability scoring, and dictionary matching. For each method, we walk through the math, the data structures, and the actual code.

PNG Image

1. The Problem: The "Last Mile" of Brute Force

The Caesar cipher has a key space of just 25 (shift values 1–25), so brute-forcing the computation itself is trivial. The real challenge is:

Given 25 decryption results, how does the program automatically pick the one that "looks like real language"?

For example, brute-forcing the ciphertext KHOOR ZRUOG:

Shift	Decrypted Result	Real Language?
1	JGNNQ YQTNF	No
2	IFMMP XPSME	No
3	HELLO WORLD	Yes
...	...	No

A human spots the answer immediately. But to a program, JGNNQ and HELLO are both just five ASCII characters — it needs a quantitative criterion to distinguish natural language from random noise.

CrypTool-2 offers four different approaches, each with its own angle.

2. Method 1: Index of Coincidence (IoC)

2.1 The Math

The Index of Coincidence was introduced by William Friedman in 1922 (Friedman, 1922), based on a simple observation:

Letter distributions in natural language are highly uneven; in random text, they tend toward uniform.

In English, E accounts for about 12.7% of all letters, while Z is just 0.074%. If you randomly pick two letters from a text, the probability they match is much higher in natural language than in random text. IoC measures exactly this probability:

$\text{IoC} = \frac{\sum_{i=1}^{c} n_i(n_i - 1)}{N(N-1)}$

Where $n_i$ is the count of the $i$ -th letter, $N$ is the total text length, and $c$ is the alphabet size.

English text: IoC ≈ 0.0661
German text: IoC ≈ 0.0762
Uniform random distribution: IoC ≈ 1/26 ≈ 0.0385

2.2 CrypTool-2 Implementation

Source file: CrypPlugins/CostFunction/CostFunction.cs, lines 384–416.

csharp
public double calculateFastIndexOfCoincidence(byte[] text, int bytesToUse)
{
    if (bytesToUse > text.Length)
    {
        bytesToUse = text.Length;
    }

    double[] n = new double[256];  // 256-length counter array covering all ASCII
    //count all ASCII symbols
    int counter = 0;
    foreach (byte b in text)
    {
        n[b]++;
        counter++;
        if (counter == bytesToUse)
        {
            break;
        }
    }

    double coindex = 0;
    //sum them
    for (int i = 0; i < n.Length; i++)
    {
        coindex = coindex + n[i] * (n[i] - 1);  // Σ n_i * (n_i - 1)
    }

    coindex = coindex / (bytesToUse);             // divide by N
    coindex = coindex / (bytesToUse - 1);         // divide by (N-1)

    return coindex;
}

Relation operator: LargerThen (higher values = more likely natural language).

See CostFunctionControl.GetRelationOperator() (same file, lines 567–580):
csharp
1case CostFunctionSettings.CostFunctionType.IOC:
2    return RelationOperator.LargerThen;

PNG Image

2.3 Limitations

IoC runs in O(N) and needs no pre-trained data — it's very fast. But it can only distinguish "uniform distribution" from "non-uniform distribution," which is pretty coarse. It might not even tell two different natural languages apart. So IoC works best as a rough filter: it quickly eliminates obvious gibberish, but struggles when two candidates both look "somewhat language-like." In cryptanalysis, its more common use is the Friedman test — estimating the key length of polyalphabetic substitution ciphers.

3. Method 2: Shannon Entropy

3.1 The Math

Shannon (1948) defined information entropy to measure the "information density" or "uncertainty" of a text:

$H = -\sum_{i=1}^{c} p_i \log_2 p_i$

Where $p_i = n_i / N$ is the probability of the $i$ -th symbol.

Natural language: uneven letter distribution → low entropy (German ≈ 4.06 bits)
Random text: 26 letters with equal probability → high entropy ( $\log_2 26 \approx 4.70$ bits)
Single character repeated: e.g., "AAAA..." → entropy = 0

3.2 CrypTool-2 Implementation

Source file: CrypPlugins/CostFunction/CostFunction.cs, lines 422–497.

CrypTool-2 uses a precomputation optimization here — it builds an xlogx lookup table upfront to avoid calling Math.Log repeatedly at runtime:

csharp
// Precomputation phase: build xlogx table (lines 422–431)
private void prepareEntropy(int size)
{
    xlogx = new float[size + 1];
    //precomputations for fast entropy calculation
    xlogx[0] = 0.0f;
    for (int i = 1; i <= size; i++)
    {
        // xlogx[i] = -i * log2(i/size)
        xlogx[i] = (float)(-1.0f * i * Math.Log(i / (double)size) / Math.Log(2.0));
    }
}

csharp
// Calculation phase (lines 450–497): switches implementation based on EntropySelection
public double calculateEntropy(byte[] text, int bytesToUse)
{
    switch (settings.EntropySelection)
    {
        case 0:  // C++/CLI native implementation for high performance
            return NativeCryptography.Crypto.calculateEntropy(text, bytesToUse);
        case 1:  // C# managed implementation (with xlogx lookup optimization)
            // ... Mutex-protected precomputation check ...
            int[] n = new int[256];
            for (int counter = 0; counter < bytesToUse; counter++)
            {
                n[text[counter]]++;              // count character frequencies
            }

            float entropy = 0;
            for (int i = 0; i < 256; i++)
            {
                entropy += xlogx[n[i]];          // table lookup, no runtime log calls
            }
            return entropy / (double)bytesToUse;
        default:
            return NativeCryptography.Crypto.calculateEntropy(text, bytesToUse);
    }
}

Users can switch between the C++/CLI native implementation and the C# managed implementation via the EntropySelection setting.

Relation operator: LessThen (lower values = more likely natural language).

3.3 IoC vs. Entropy: Are They Redundant?

IoC and Entropy essentially measure the same thing — how uneven the letter distribution is. Higher IoC means lower Entropy; they're roughly monotonically related. But they differ in edge-case behavior and numerical sensitivity, so CrypTool-2 keeps both as separate options.

4. Method 3: N-gram Log-Probability Scoring (The Core Method)

4.1 From Letter Frequencies to Letter Combination Frequencies

IoC and Entropy only look at how often each individual letter appears — they don't care about ordering. So ETAOIN SHRDLU (the most common English letters arranged together) and HELLO WORLD (a valid English sentence) might score about the same, because their single-letter frequency distributions are too similar.

To tell "plausible-looking gibberish" from actual language, you need to look at letter combination patterns. In English, TH appears very frequently while QZ almost never does; TION is a common 4-letter sequence, XKZQ is not. N-gram statistics are designed to capture exactly these letter sequence patterns.

4.2 Data Generation: From Corpus to Probability Table

CrypTool-2 uses LanguageStatisticsGenerator (source: Util/LanguageStatisticsGenerator/Program.cs) to generate N-gram frequency data from large text corpora. It supports two data sources:

Project Gutenberg: ZIP files of public domain books
Wikipedia XML Dump: XML exports of Wikipedia

The source selection is straightforward (lines 71–78):

csharp
if (path.ToUpper().EndsWith("XML"))
{
    ReadInWikipediaXML(path);
}
else
{
    ReadInGutenbergZips(path);
}

The generation pipeline (using 5-grams as an example):

text

Wikipedia/Gutenberg corpus
    ↓ sliding window scan
    ↓ 8 worker threads (WORKERS = 8) + ConcurrentQueue + Lock merge
5D frequency array uint[26,26,26,26,26]
    ↓ probability conversion
    ↓ Math.Log(count == 0 ? 1/sum : count/sum)
log-probability table float[26,26,26,26,26]
    ↓ serialization
CTLS format GZip compressed file (.gz)

The key probability conversion code (Program.cs, lines 542–545):

csharp
private static IEnumerable<float> CalculateLogs(Array freq, ulong sum)
{
    return freq.Cast<uint>().Select(value =>
        (float)Math.Log(value == 0 ? 1.0 / sum : value / (double)sum));
}

Why logarithms instead of raw probabilities? Two reasons. First, raw probabilities are tiny — a 5-gram probability might be on the order of 10^-8, and multiplying many of these together will underflow to zero. Taking the log turns multiplication into addition, which is numerically much more stable. Second, N-grams that never appeared in the corpus (count=0) can't have log(0), so the code substitutes 1/sum — an extremely small probability that becomes a large negative number after taking the log, effectively penalizing rare combinations.

4.3 File Format: The CTLS Binary Protocol

Source file: LibSource/LanguageStatisticsLib/LanguageStatisticsFile.cs

Header structure:

text

[Magic: 4 bytes "CTLS"] [language code: BinaryWriter.WriteString] [gram length: int32]
[alphabet: BinaryWriter.WriteString] [frequency data: float[] contiguous block]

Magic number definition (line 31):

csharp
public const uint FileFormatMagicNumber = 'C' + ('T' << 8) + ('L' << 16) + ('S' << 24);

The entire file is GZip-compressed. At load time, Buffer.BlockCopy copies the frequency data into a multidimensional array in one shot (lines 78–79), making subsequent lookups simple array indexing:

csharp
byte[] frequencyData = br.ReadBytes(sizeof(float) * frequencyEntries);
Buffer.BlockCopy(frequencyData, 0, frequencyArray, 0, frequencyData.Length);

For English 4-grams, the array has 26^4 = 456,976 floats, roughly 1.7 MB uncompressed. For 5-grams, that's 26^5 ≈ 11.9 million floats, about 45 MB.

4.4 Scoring Algorithm: Sliding Window Summation

Using Tetragrams (4-grams) as an example, source: LibSource/LanguageStatisticsLib/Grams/Tetragrams.cs, lines 44–84:

csharp
public override double CalculateCost(int[] text)
{
    int end = text.Length - 3;
    if (end <= 0)
    {
        return 0;
    }

    double value = 0;
    int alphabetLength = Alphabet.Length;

    for (int i = 0; i < end; i++)
    {
        int a = text[i];
        int b = text[i + 1];
        int c = text[i + 2];
        int d = text[i + 3];

        // addLetterIndicies: optional alphabet reduction mapping
        if (addLetterIndicies != null)
        {
            a += addLetterIndicies[a];
            b += addLetterIndicies[b];
            c += addLetterIndicies[c];
            d += addLetterIndicies[d];
        }

        // bounds check: skip characters outside the alphabet
        if (a >= alphabetLength ||
            b >= alphabetLength ||
            c >= alphabetLength ||
            d >= alphabetLength ||
            a < 0 || b < 0 || c < 0 || d < 0)
        {
            continue;
        }
        value += Frequencies[a, b, c, d];  // O(1) multidimensional array lookup
    }
    return value / end;  // average log-probability per N-gram
}

It's just a sliding window: scan through the text, grab 4 consecutive letters each time, look up their log-probability in the frequency table, sum everything up, divide by the number of windows.

For "HELLO":

text

H-E-L-L → Frequencies['H','E','L','L']  = log(P("HELL"))  ≈ -8.2  (common combination)
E-L-L-O → Frequencies['E','L','L','O']  = log(P("ELLO"))  ≈ -9.1  (fairly common)
Average ≈ -8.65

For random gibberish "XKZQ":

text

X-K-Z-Q → Frequencies['X','K','Z','Q']  = log(1/sum)      ≈ -15.3 (extremely rare)
Average ≈ -15.3

Higher values (closer to 0) mean the text looks more like natural language. The gap is pretty obvious.

4.5 Call Chain in CrypTool-2

When the CostFunction plugin is set to NGramsLog2 mode, the call chain looks like:

text

CostFunction.CalculateCost(byte[] text)
  → text → UTF8 decode → ToUpper() → string
  → LanguageStatistics.MapTextIntoNumberSpace(text, alphabet)
    → map each character to its index in the alphabet → int[]
  → Grams.CalculateCost(int[] numbers)
    → sliding window, frequency table lookup, sum, average
  → return average log-probability

Source: CostFunction.cs, lines 222–224:

csharp
case CostFunctionSettings.CostFunctionType.NGramsLog2:
    return Grams.CalculateCost(LanguageStatistics.MapTextIntoNumberSpace(
        Encoding.UTF8.GetString(text).ToUpper(),
        LanguageStatistics.Alphabets[LanguageStatistics.LanguageCode(settings.Language)]));

MapTextIntoNumberSpace (LanguageStatistics.cs, lines 312–322) maps each character to its alphabet index:

csharp
public static int[] MapTextIntoNumberSpace(string text, string alphabet)
{
    int[] numbers = new int[text.Length];
    int position = 0;
    foreach (char c in text)
    {
        numbers[position] = alphabet.IndexOf(c);
        position++;
    }
    return numbers;
}

4.6 Choosing the N-gram Size

CrypTool-2 supports 1-gram through 5-gram via CostFunctionSettings.NGramSize, with the LanguageStatisticsLib library also supporting 6-grams. The default is 5-gram (Pentagrams).

N-gram Size	Array Dimensions	Storage (26-letter)	Granularity	Typical Use
1-gram	26	104 B	Single-letter frequency	Equivalent to IoC/Entropy
2-gram	26^2 = 676	2.6 KB	Letter pairs	Quick rough filtering
3-gram	26^3 = 17,576	68 KB	Short patterns	Common in Enigma analysis
4-gram	26^4 ≈ 460K	1.7 MB	Medium patterns	General analysis
5-gram	26^5 ≈ 11.9M	45 MB	Long patterns	CT2 default, highest precision
6-gram	26^6 ≈ 309M	1.2 GB	Very long patterns	Memory-constrained

Larger N means better accuracy, but storage grows exponentially. 5-gram strikes a practical balance between precision and memory.

The factory method in LanguageStatistics.cs (lines 180–198) reflects this design:

csharp
public static Grams CreateGrams(int languageId, string languageStatisticsDirectory,
                                 GramsType gramsType, bool useSpaces)
{
    switch (gramsType)
    {
        case GramsType.Unigrams:     return new Unigrams(...);
        case GramsType.Bigrams:      return new Bigrams(...);
        case GramsType.Trigrams:     return new Trigrams(...);
        case GramsType.Tetragrams:   return new Tetragrams(...);
        case GramsType.Pentragrams:  // CT2's default ngram size
        default:                     return new Pentagrams(...);
        case GramsType.Hexagrams:    return new Hexagrams(...);
    }
}

4.7 Multi-Language Support

LanguageStatistics.cs defines alphabets for 15 languages (lines 142–159):

csharp
public static Dictionary<string, string> Alphabets = new Dictionary<string, string>()
{
    {"en", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // English: 26 letters
    {"de", "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß" },                  // German: 30 letters (with umlauts)
    {"fr", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // French
    {"es", "ABCDEFGHIJKLMNOPQRSTUVWXYZÑ" },                     // Spanish: 27 letters
    {"it", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Italian
    {"hu", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Hungarian
    {"ru", "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ" },               // Russian: 33 Cyrillic letters
    {"cs", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Czech
    {"la", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Latin
    {"el", "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ" },                        // Greek: 24 letters
    {"nl", "ABCDEFGHIJKLMNOPQRSTUVWXYZ"},                       // Dutch
    {"sv", "ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ"},                    // Swedish: 29 letters
    {"pt", "ABCDEFGHIJKLMNOPQRSTUVWXYZ"},                       // Portuguese
    {"pl", "AĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUVWXYZŹŻ"},              // Polish: 35 letters
    {"tr", "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ" }                    // Turkish: 29 letters
};

Each language has its own N-gram frequency files (e.g., en-5gram-nocs.bin, de-4gram-nocs-sp.bin), so scoring can target a specific language.

The code also hard-codes Unigram frequencies for each language (lines 118–140), with comments citing Wikipedia and practicalcryptography.com as data sources.

5. Method 4: Dictionary Matching

5.1 How It Works

The first three methods all produce probabilistic scores — they compute a number and pick the highest. Dictionary matching takes a more straightforward approach: how many real words can you find in the decrypted output? If you find enough, it's probably the right plaintext.

CrypTool-2's Caesar brute-force template (Templates/Cryptanalysis/Classic/Caesar_ExhaustiveKeySearch.xml) uses this workflow:

text

Ciphertext → [Caesar component: iterate shift values 1–26]
              → each candidate plaintext → [Contains component]
                                             ↑
                                  [Dictionary component: load language dictionary]

Contains component: matched word count ≥ threshold → Result = true

5.2 Dictionary Data Structure: WordTree (Trie)

Source file: LibSource/LanguageStatisticsLib/WordTree.cs

The dictionary is stored as a Trie (prefix tree), serialized into a custom binary format (magic number CT2DIC), and GZip-compressed into .dic files. Each language has its own dictionary file (e.g., Dictionary_en.dic, Dictionary_de.dic).

The deserialization logic (lines 39–101) rebuilds the tree using a stack:

csharp
public static WordTree Deserialize(BinaryReader reader)
{
    WordTree tree = new WordTree();

    // 1. Validate magic number
    string magicNo = new string(reader.ReadChars(6));
    if (magicNo != "CT2DIC")
    {
        throw new Exception("File does not start with the expected magic number for word tree.");
    }

    // 2. Read language code and alphabet (0-terminated strings)
    // ...

    // 3. Rebuild the trie using a stack
    Stack<Node> stack = new Stack<Node>();
    stack.Push(tree);
    int symbol;
    while ((symbol = reader.Read()) != -1)
    {
        readChar = (char)symbol;
        if (readChar == Node.WordEndSymbol)      // word-end marker
        {
            stack.Peek().WordEndsHere = true;
            tree.StoredWords++;
            continue;
        }
        if (readChar == Node.TerminationSymbol)  // node termination marker
        {
            stack.Pop();
            continue;
        }
        Node newNode = new Node() { Value = readChar };
        stack.Peek().ChildNodes.Add(newNode);
        stack.Push(newNode);
    }
    return tree;
}

5.3 Matching Algorithm: Aho-Corasick Multi-Pattern Search

The Contains component (source: CrypPlugins/Contains/Contains.cs) supports two matching strategies:

Strategy A: Aho-Corasick (default)

Aho-Corasick is a classic multi-pattern string matching algorithm (Aho & Corasick, 1975) that can search for all dictionary words simultaneously in O(N + M + Z) time (N = text length, M = total pattern length, Z = number of matches). It's a perfect fit for "search for tens of thousands of dictionary words in a text at once."

Implementation: CrypPlugins/Contains/Aho-Corasick/StringSearch.cs, core interface:

csharp
public class StringSearch : IStringSearchAlgorithm
{
    public bool ContainsAny(string text);             // check if text contains any keyword
    public StringSearchResult[] FindAll(string text); // find all matches
    public StringSearchResult FindFirst(string text); // find first match
}

The TreeNode inner class implements the Aho-Corasick automaton state nodes:

_failure: failure link pointer (built via BFS)
_results: list of keywords matched at this node
_transHash: transition function (character → child node mapping)

Strategy B: Hashtable word-by-word lookup

Dictionary words are loaded into a hash table. The input text is tokenized by delimiter and each token is looked up individually. Better suited when the input is already tokenized.

Search structure initialization (Contains.cs, lines 161–215):

csharp
private void SetSearchStructure()
{
    if (settings.Search == ContainsSettings.SearchType.AhoCorasick)
    {
        stringSearch = new StringSearch(
            dictionaryInputString.Split(settings.DelimiterDictionary[0]));
    }
    else if (settings.Search == ContainsSettings.SearchType.Hashtable)
    {
        hashTable = new Hashtable();
        string[] theWords = dictionaryInputString.Split(settings.DelimiterDictionary[0]);
        foreach (string item in theWords)
        {
            if (!hashTable.ContainsKey(item))
            {
                hashTable.Add(item, null);
            }
        }
    }
}

5.4 Decision Logic

Contains.cs, lines 337–338:

csharp
HitCount = listReturn.Count;          // number of dictionary words matched
Result = (HitCount >= settings.Hits);  // above threshold = "real language"

5.5 Limitations

Dictionary matching gives high-confidence results — finding actual words is deterministic evidence, unlike the fuzzy scores from statistical methods. But the tradeoffs are clear: every language needs a pre-built dictionary, and abbreviations, proper nouns, and spelling variants can cause misses. Aho-Corasick runs in O(N), so speed isn't an issue. This method works best for small key spaces (like Caesar's 25 shifts), where you can afford to check every candidate against the dictionary.

6. Bonus: CaesarAnalysisHelper's Heuristic Frequency Analysis

CrypTool-2 also has a component specifically for Caesar ciphers called CaesarAnalysisHelper (source: CrypPlugins/CaesarAnalysisHelper/CaesarAnalysisHelper.cs). It takes a completely different approach — instead of trial-decrypting and scoring, it infers the key directly from the ciphertext's statistical properties.

The corresponding template is Templates/Cryptanalysis/Classic/Caesar_Analysis_Using-character-frequencies.xml.

6.1 Algorithm Steps

Step 1: Single-letter frequency voting (CryptoAnalysis method, lines 83–101)

csharp
Dictionary<int, int> KeyList = new Dictionary<int, int>();
int Counter = 0;
foreach (int i in CountChars(frequencyList.ToLower()))
{
    if (Counter < 5)
    {
        if (!KeyList.ContainsKey(i))
        {
            KeyList.Add(i, 5 - Counter);  // weighted voting: rank 1 gets weight 5, ...
        }
        else
        {
            KeyList[i] += 5 - Counter;
        }
        Counter++;
    }
}

The idea is straightforward: if the most frequent letter in the ciphertext is H, and the most frequent letter in English is E (configured via settings.FrequentChar), then the key is probably H - E = 3.

The CountChars method (lines 135–188) parses the frequency list and computes the offset between each high-frequency letter and the target language's most frequent letter.

Step 2: Bigram frequency voting (CountBigrams method, lines 190–257)

csharp
// Hard-coded German high-frequency bigrams
string[] Bigrams = new[] { "er", "en", "ch", "de" };

The key observation: Caesar encryption preserves the distance between the two letters within a bigram. For example, in "er", r - e = 13. After encryption it might become "hu" (u - h = 13). By matching on distances, candidates can be identified quickly:

csharp
// Line 232: matching based on the invariant character distance
} while (!(CurrentBigramm[1] - CurrentBigramm[0] == s[1] - s[0]));

Once a match is found, the key is inferred via s[0] - CurrentBigramm[0] (line 236).

Step 3: Combined voting (lines 120–132)

csharp
IOrderedEnumerable<int> items = (from k in KeyList.Keys
                                 orderby KeyList[k] descending
                                 select k);
List<int> ResultList = new List<int>();
foreach (int i in items)
{
    ResultList.Add(i);
}
if (ResultList.Count > 0)
{
    key = ResultList[0];  // highest score = most likely key
}

6.2 Assessment

This method skips trial decryption entirely and guesses the key directly. The upside: no need to iterate through all candidates. The downside: if the ciphertext is too short, the frequency statistics won't be stable enough and the guess may be off.

Also worth noting: CountBigrams hard-codes German bigrams ("er", "en", "ch", "de"), so this component is biased toward German. For other languages, you'd need to swap in the corresponding high-frequency bigrams.

7. Overall Architecture

Putting it all together, CrypTool-2's language identification system has roughly four layers:

text

┌─────────────────────────────────────────────────┐
│            Application Layer (Analyzers)        │
│  Caesar Brute Force / Vigenere / Enigma / ...   │
├─────────────────────────────────────────────────┤
│          Evaluation Interface (IControlCost)    │
│  CostFunction (IOC | Entropy | NGramsLog2|RegEx)│
│  RelationOperator: LargerThen / LessThen        │
├─────────────────────────────────────────────────┤
│       Language Statistics Engine                │
│       (LanguageStatisticsLib)                   │
│  Grams family (1–6 gram) │ WordTree(Trie dict)  │
│  LanguageStatistics      │ CalculateIoC         │
├─────────────────────────────────────────────────┤
│                 Data Layer                      │
│  CTLS format N-gram frequency files (.bin.gz)   │
│  CT2DIC format dictionary files (.dic)          │
│  Hard-coded Unigram frequencies (15 languages)  │
└─────────────────────────────────────────────────┘

Interface Design

IControlCost (CrypPluginBase/Control/IControlCost.cs) is the public contract for the cost function system:

csharp
public enum RelationOperator
{
    LessThen, LargerThen
}

public interface IControlCost : IControl
{
    RelationOperator GetRelationOperator();  // higher is better, or lower is better?
    double CalculateCost(byte[] text);       // compute the cost value
    int GetBytesToUse();
    int GetBytesOffset();
}

Analyzers program against the IControlCost interface, not the concrete scoring algorithm. To switch scoring methods, you just change the CostFunctionSettings.CostFunctionType enum value — no analyzer code needs to change.

Cost Function Type Enum

csharp
// CostFunctionSettings.cs, lines 27–33
public enum CostFunctionType
{
    IOC = 0,        // Index of Coincidence
    Entropy = 1,    // Shannon Entropy
    NGramsLog2 = 2, // N-gram log-probability (the core method)
    RegEx = 3       // Regular expression matching
}

8. Comparison

Method	Mathematical Basis	Time Complexity	Extra Data Required	Discrimination	Best Use Case
IoC	Discrete probability	O(N)	None	Low	Quick filtering; key length estimation
Entropy	Information theory	O(N)	None	Low	Quick filtering
N-gram	Statistical language model	O(N)	Large N-gram frequency table	High	Go-to for all cryptanalysis scenarios
Dictionary	Exact string matching	O(N+M+Z)	Language dictionary	Very high	Small key space exhaustive search
Frequency inference	Frequency analysis heuristic	O(N)	Target language frequency prior	Medium	Quick Caesar / monoalphabetic cracking

Which One to Use in Practice?

For Caesar ciphers (key space of just 25), CrypTool-2 provides two ready-made templates:

Dictionary matching (Caesar_ExhaustiveKeySearch.xml): try all 25 shift values, run each result through Aho-Corasick dictionary detection, and stop when enough words are found. Deterministic results.
Direct frequency inference (Caesar_Analysis_Using-character-frequencies.xml): use CaesarAnalysisHelper to compute the key from ciphertext frequencies directly, without iterating through candidates. Fastest.

For more complex ciphers (Vigenere, Enigma, Playfair, etc.), where the key space is too large to enumerate, N-gram scoring becomes the fitness function for search algorithms like hill climbing, genetic algorithms, and simulated annealing — guiding the search toward "more language-like" results.

9. Comparison with LLMs

Dimension	Traditional Statistical Methods (CrypTool-2)	LLM Approach
Basis for judgment	Letter combination frequency patterns	Deep semantic understanding
Data requirements	MB-scale frequency tables + dictionaries	GB-scale model parameters
Computation speed	Microseconds (per evaluation)	Milliseconds to seconds
Semantic understanding	None (pure statistics)	Yes (understands meaning)
Explainability	Fully transparent	Black box
Multi-language extension	Needs per-language frequency data	Covered by training data
Offline capability	Fully offline	Usually requires inference service

Traditional methods don't need to "understand" what the language says — they just need to catch its statistical fingerprint. This line of thinking goes all the way back to Friedman (1922) and Shannon (1948) and is still the foundation of cryptanalysis today.

Looked at differently, the N-gram method is really just a Markov chain language model — it assumes the probability of the current letter depends only on the previous N-1 letters. Ravi and Knight (2008) showed that even low-order character N-gram models can effectively solve substitution ciphers. This is mathematically the same lineage as modern NLP language models, just several orders of magnitude simpler. CrypTool-2's 5-gram scoring is, at the end of the day, a 5th-order Markov chain LM — crude, but it gets the job done for cryptanalysis.

10. Conclusion

Looking back, what CrypTool-2 does here answers a pretty basic question: natural language isn't random. Its letter distributions, letter combinations, and vocabulary all follow patterns. Capture those patterns, and you can tell real language from gibberish.

Each method has its own granularity: IoC and Entropy just check whether single-letter frequencies are "skewed" enough; N-gram scoring checks whether letter sequences match the habits of a language; dictionary matching goes straight for the kill — finding actual words is hard evidence.

None of these methods are obsolete. They run in microseconds, produce deterministic results, need no GPU, and can be called repeatedly inside a search loop. When you need to evaluate billions of candidate keys, N-gram scoring is still the standard approach.

Appendix: Key Source File Index

File Path	Function
`CrypPlugins/CostFunction/CostFunction.cs`	Unified entry point for all four cost functions (IoC/Entropy/NGram/RegEx)
`CrypPlugins/CostFunction/CostFunctionSettings.cs`	Cost function configuration (type/language/N-gram size)
`CrypPluginBase/Control/IControlCost.cs`	Cost function interface (RelationOperator/CalculateCost)
`LibSource/LanguageStatisticsLib/Grams/Grams.cs`	N-gram scoring abstract base class (template method pattern)
`LibSource/LanguageStatisticsLib/Grams/Pentagrams.cs`	5-gram scoring implementation (CT2 default)
`LibSource/LanguageStatisticsLib/Grams/Tetragrams.cs`	4-gram scoring implementation
`LibSource/LanguageStatisticsLib/LanguageStatistics.cs`	Language statistics facade (15 language alphabets + Unigram frequencies)
`LibSource/LanguageStatisticsLib/LanguageStatisticsFile.cs`	CTLS file format parser (Magic: CTLS + GZip + BlockCopy)
`LibSource/LanguageStatisticsLib/WordTree.cs`	Trie dictionary data structure (Magic: CT2DIC)
`CrypPlugins/CaesarAnalysisHelper/CaesarAnalysisHelper.cs`	Heuristic frequency inference for key recovery
`CrypPlugins/Contains/Contains.cs`	Aho-Corasick / Hashtable dictionary matching
`CrypPlugins/Contains/Aho-Corasick/StringSearch.cs`	Aho-Corasick automaton implementation
`Util/LanguageStatisticsGenerator/Program.cs`	N-gram frequency data generator (Gutenberg/Wikipedia)
`Templates/Cryptanalysis/Classic/Caesar_ExhaustiveKeySearch.xml`	Caesar exhaustive search template (dictionary matching)
`Templates/Cryptanalysis/Classic/Caesar_Analysis_Using-character-frequencies.xml`	Caesar frequency analysis template

References

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340. https://doi.org/10.1145/360825.360855

Friedman, W. F. (1922). The index of coincidence and its applications in cryptography (Riverbank Publication No. 22). Riverbank Laboratories.

Kopal, N. (2018). Solving classical ciphers with CrypTool 2. In Proceedings of the 1st International Conference on Historical Cryptology (HistoCrypt 2018) (Vol. 149, pp. 29–38). Linköping University Electronic Press.

Kopal, N., & Esslinger, B. (2022). New ciphers and cryptanalysis components in CrypTool 2. In Proceedings of the International Conference on Historical Cryptology (HistoCrypt 2022) (pp. 127–136). Linköping University Electronic Press.

Ravi, S., & Knight, K. (2008). Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 812–819). Association for Computational Linguistics.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x