March 3, 2026
23 Views
Welcome

Before LLMs, How Did Programs Tell Real Language from Gibberish?

Before large language models (LLMs) existed, cryptanalysis tools faced a very practical problem: brute-forcing a cipher produces a pile of candidate plaintexts — how does the program figure out which one is actual language and which is garbage? This article digs into the source code of CrypTool-2, an open-source cryptography education platform, to dissect the four natural language identification methods it uses in Caesar cipher brute force: Index of Coincidence (IoC), Shannon Entropy, N-gram log-probability scoring, and dictionary matching. For each method, we walk through the math, the data structures, and the actual code.

Before LLMs, How Did Programs Tell Real Language from Gibberish?

Natural Language Identification in CrypTool-2's Caesar Cipher Brute Force


Abstract

Before large language models (LLMs) existed, cryptanalysis tools faced a very practical problem: brute-forcing a cipher produces a pile of candidate plaintexts — how does the program figure out which one is actual language and which is garbage? This article digs into the source code of CrypTool-2, an open-source cryptography education platform, to dissect the four natural language identification methods it uses in Caesar cipher brute force: Index of Coincidence (IoC), Shannon Entropy, N-gram log-probability scoring, and dictionary matching. For each method, we walk through the math, the data structures, and the actual code.

PNG Image


1. The Problem: The "Last Mile" of Brute Force

The Caesar cipher has a key space of just 25 (shift values 1–25), so brute-forcing the computation itself is trivial. The real challenge is:

Given 25 decryption results, how does the program automatically pick the one that "looks like real language"?

For example, brute-forcing the ciphertext KHOOR ZRUOG:

ShiftDecrypted ResultReal Language?
1JGNNQ YQTNFNo
2IFMMP XPSMENo
3HELLO WORLDYes
......No

A human spots the answer immediately. But to a program, JGNNQ and HELLO are both just five ASCII characters — it needs a quantitative criterion to distinguish natural language from random noise.

CrypTool-2 offers four different approaches, each with its own angle.


2. Method 1: Index of Coincidence (IoC)

2.1 The Math

The Index of Coincidence was introduced by William Friedman in 1922 (Friedman, 1922), based on a simple observation:

Letter distributions in natural language are highly uneven; in random text, they tend toward uniform.

In English, E accounts for about 12.7% of all letters, while Z is just 0.074%. If you randomly pick two letters from a text, the probability they match is much higher in natural language than in random text. IoC measures exactly this probability:

IoC=i=1cni(ni1)N(N1)\text{IoC} = \frac{\sum_{i=1}^{c} n_i(n_i - 1)}{N(N-1)}

Where nin_i is the count of the ii-th letter, NN is the total text length, and cc is the alphabet size.

  • English text: IoC ≈ 0.0661
  • German text: IoC ≈ 0.0762
  • Uniform random distribution: IoC ≈ 1/26 ≈ 0.0385

2.2 CrypTool-2 Implementation

Source file: CrypPlugins/CostFunction/CostFunction.cs, lines 384–416.

csharp
1public double calculateFastIndexOfCoincidence(byte[] text, int bytesToUse)
2{
3    if (bytesToUse > text.Length)
4    {
5        bytesToUse = text.Length;
6    }
7
8    double[] n = new double[256];  // 256-length counter array covering all ASCII
9    //count all ASCII symbols
10    int counter = 0;
11    foreach (byte b in text)
12    {
13        n[b]++;
14        counter++;
15        if (counter == bytesToUse)
16        {
17            break;
18        }
19    }
20
21    double coindex = 0;
22    //sum them
23    for (int i = 0; i < n.Length; i++)
24    {
25        coindex = coindex + n[i] * (n[i] - 1);  // Σ n_i * (n_i - 1)
26    }
27
28    coindex = coindex / (bytesToUse);             // divide by N
29    coindex = coindex / (bytesToUse - 1);         // divide by (N-1)
30
31    return coindex;
32}

Relation operator: LargerThen (higher values = more likely natural language).

See CostFunctionControl.GetRelationOperator() (same file, lines 567–580):

csharp
1case CostFunctionSettings.CostFunctionType.IOC:
2    return RelationOperator.LargerThen;

PNG Image

2.3 Limitations

IoC runs in O(N) and needs no pre-trained data — it's very fast. But it can only distinguish "uniform distribution" from "non-uniform distribution," which is pretty coarse. It might not even tell two different natural languages apart. So IoC works best as a rough filter: it quickly eliminates obvious gibberish, but struggles when two candidates both look "somewhat language-like." In cryptanalysis, its more common use is the Friedman test — estimating the key length of polyalphabetic substitution ciphers.


3. Method 2: Shannon Entropy

3.1 The Math

Shannon (1948) defined information entropy to measure the "information density" or "uncertainty" of a text:

H=i=1cpilog2piH = -\sum_{i=1}^{c} p_i \log_2 p_i

Where pi=ni/Np_i = n_i / N is the probability of the ii-th symbol.

  • Natural language: uneven letter distribution → low entropy (German ≈ 4.06 bits)
  • Random text: 26 letters with equal probability → high entropy (log2264.70\log_2 26 \approx 4.70 bits)
  • Single character repeated: e.g., "AAAA..." → entropy = 0

3.2 CrypTool-2 Implementation

Source file: CrypPlugins/CostFunction/CostFunction.cs, lines 422–497.

CrypTool-2 uses a precomputation optimization here — it builds an xlogx lookup table upfront to avoid calling Math.Log repeatedly at runtime:

csharp
1// Precomputation phase: build xlogx table (lines 422–431)
2private void prepareEntropy(int size)
3{
4    xlogx = new float[size + 1];
5    //precomputations for fast entropy calculation
6    xlogx[0] = 0.0f;
7    for (int i = 1; i <= size; i++)
8    {
9        // xlogx[i] = -i * log2(i/size)
10        xlogx[i] = (float)(-1.0f * i * Math.Log(i / (double)size) / Math.Log(2.0));
11    }
12}
csharp
1// Calculation phase (lines 450–497): switches implementation based on EntropySelection
2public double calculateEntropy(byte[] text, int bytesToUse)
3{
4    switch (settings.EntropySelection)
5    {
6        case 0:  // C++/CLI native implementation for high performance
7            return NativeCryptography.Crypto.calculateEntropy(text, bytesToUse);
8        case 1:  // C# managed implementation (with xlogx lookup optimization)
9            // ... Mutex-protected precomputation check ...
10            int[] n = new int[256];
11            for (int counter = 0; counter < bytesToUse; counter++)
12            {
13                n[text[counter]]++;              // count character frequencies
14            }
15
16            float entropy = 0;
17            for (int i = 0; i < 256; i++)
18            {
19                entropy += xlogx[n[i]];          // table lookup, no runtime log calls
20            }
21            return entropy / (double)bytesToUse;
22        default:
23            return NativeCryptography.Crypto.calculateEntropy(text, bytesToUse);
24    }
25}

Users can switch between the C++/CLI native implementation and the C# managed implementation via the EntropySelection setting.

Relation operator: LessThen (lower values = more likely natural language).

3.3 IoC vs. Entropy: Are They Redundant?

IoC and Entropy essentially measure the same thing — how uneven the letter distribution is. Higher IoC means lower Entropy; they're roughly monotonically related. But they differ in edge-case behavior and numerical sensitivity, so CrypTool-2 keeps both as separate options.


4. Method 3: N-gram Log-Probability Scoring (The Core Method)

4.1 From Letter Frequencies to Letter Combination Frequencies

IoC and Entropy only look at how often each individual letter appears — they don't care about ordering. So ETAOIN SHRDLU (the most common English letters arranged together) and HELLO WORLD (a valid English sentence) might score about the same, because their single-letter frequency distributions are too similar.

To tell "plausible-looking gibberish" from actual language, you need to look at letter combination patterns. In English, TH appears very frequently while QZ almost never does; TION is a common 4-letter sequence, XKZQ is not. N-gram statistics are designed to capture exactly these letter sequence patterns.

4.2 Data Generation: From Corpus to Probability Table

CrypTool-2 uses LanguageStatisticsGenerator (source: Util/LanguageStatisticsGenerator/Program.cs) to generate N-gram frequency data from large text corpora. It supports two data sources:

  1. Project Gutenberg: ZIP files of public domain books
  2. Wikipedia XML Dump: XML exports of Wikipedia

The source selection is straightforward (lines 71–78):

csharp
1if (path.ToUpper().EndsWith("XML"))
2{
3    ReadInWikipediaXML(path);
4}
5else
6{
7    ReadInGutenbergZips(path);
8}

The generation pipeline (using 5-grams as an example):

text
1Wikipedia/Gutenberg corpus
2    ↓ sliding window scan
3    ↓ 8 worker threads (WORKERS = 8) + ConcurrentQueue + Lock merge
45D frequency array uint[26,26,26,26,26]
5    ↓ probability conversion
6    ↓ Math.Log(count == 0 ? 1/sum : count/sum)
7log-probability table float[26,26,26,26,26]
8    ↓ serialization
9CTLS format GZip compressed file (.gz)

The key probability conversion code (Program.cs, lines 542–545):

csharp
1private static IEnumerable<float> CalculateLogs(Array freq, ulong sum)
2{
3    return freq.Cast<uint>().Select(value =>
4        (float)Math.Log(value == 0 ? 1.0 / sum : value / (double)sum));
5}

Why logarithms instead of raw probabilities? Two reasons. First, raw probabilities are tiny — a 5-gram probability might be on the order of 10^-8, and multiplying many of these together will underflow to zero. Taking the log turns multiplication into addition, which is numerically much more stable. Second, N-grams that never appeared in the corpus (count=0) can't have log(0), so the code substitutes 1/sum — an extremely small probability that becomes a large negative number after taking the log, effectively penalizing rare combinations.

4.3 File Format: The CTLS Binary Protocol

Source file: LibSource/LanguageStatisticsLib/LanguageStatisticsFile.cs

Header structure:

text
1[Magic: 4 bytes "CTLS"] [language code: BinaryWriter.WriteString] [gram length: int32]
2[alphabet: BinaryWriter.WriteString] [frequency data: float[] contiguous block]

Magic number definition (line 31):

csharp
1public const uint FileFormatMagicNumber = 'C' + ('T' << 8) + ('L' << 16) + ('S' << 24);

The entire file is GZip-compressed. At load time, Buffer.BlockCopy copies the frequency data into a multidimensional array in one shot (lines 78–79), making subsequent lookups simple array indexing:

csharp
1byte[] frequencyData = br.ReadBytes(sizeof(float) * frequencyEntries);
2Buffer.BlockCopy(frequencyData, 0, frequencyArray, 0, frequencyData.Length);

For English 4-grams, the array has 26^4 = 456,976 floats, roughly 1.7 MB uncompressed. For 5-grams, that's 26^5 ≈ 11.9 million floats, about 45 MB.

4.4 Scoring Algorithm: Sliding Window Summation

Using Tetragrams (4-grams) as an example, source: LibSource/LanguageStatisticsLib/Grams/Tetragrams.cs, lines 44–84:

csharp
1public override double CalculateCost(int[] text)
2{
3    int end = text.Length - 3;
4    if (end <= 0)
5    {
6        return 0;
7    }
8
9    double value = 0;
10    int alphabetLength = Alphabet.Length;
11
12    for (int i = 0; i < end; i++)
13    {
14        int a = text[i];
15        int b = text[i + 1];
16        int c = text[i + 2];
17        int d = text[i + 3];
18
19        // addLetterIndicies: optional alphabet reduction mapping
20        if (addLetterIndicies != null)
21        {
22            a += addLetterIndicies[a];
23            b += addLetterIndicies[b];
24            c += addLetterIndicies[c];
25            d += addLetterIndicies[d];
26        }
27
28        // bounds check: skip characters outside the alphabet
29        if (a >= alphabetLength ||
30            b >= alphabetLength ||
31            c >= alphabetLength ||
32            d >= alphabetLength ||
33            a < 0 || b < 0 || c < 0 || d < 0)
34        {
35            continue;
36        }
37        value += Frequencies[a, b, c, d];  // O(1) multidimensional array lookup
38    }
39    return value / end;  // average log-probability per N-gram
40}

It's just a sliding window: scan through the text, grab 4 consecutive letters each time, look up their log-probability in the frequency table, sum everything up, divide by the number of windows.

For "HELLO":

text
1H-E-L-L → Frequencies['H','E','L','L']  = log(P("HELL"))  ≈ -8.2  (common combination)
2E-L-L-O → Frequencies['E','L','L','O']  = log(P("ELLO"))  ≈ -9.1  (fairly common)
3Average ≈ -8.65

For random gibberish "XKZQ":

text
1X-K-Z-Q → Frequencies['X','K','Z','Q']  = log(1/sum)      ≈ -15.3 (extremely rare)
2Average ≈ -15.3

Higher values (closer to 0) mean the text looks more like natural language. The gap is pretty obvious.

4.5 Call Chain in CrypTool-2

When the CostFunction plugin is set to NGramsLog2 mode, the call chain looks like:

text
1CostFunction.CalculateCost(byte[] text)
2  → text → UTF8 decode → ToUpper() → string
3  → LanguageStatistics.MapTextIntoNumberSpace(text, alphabet)
4    → map each character to its index in the alphabet → int[]
5  → Grams.CalculateCost(int[] numbers)
6    → sliding window, frequency table lookup, sum, average
7  → return average log-probability

Source: CostFunction.cs, lines 222–224:

csharp
1case CostFunctionSettings.CostFunctionType.NGramsLog2:
2    return Grams.CalculateCost(LanguageStatistics.MapTextIntoNumberSpace(
3        Encoding.UTF8.GetString(text).ToUpper(),
4        LanguageStatistics.Alphabets[LanguageStatistics.LanguageCode(settings.Language)]));

MapTextIntoNumberSpace (LanguageStatistics.cs, lines 312–322) maps each character to its alphabet index:

csharp
1public static int[] MapTextIntoNumberSpace(string text, string alphabet)
2{
3    int[] numbers = new int[text.Length];
4    int position = 0;
5    foreach (char c in text)
6    {
7        numbers[position] = alphabet.IndexOf(c);
8        position++;
9    }
10    return numbers;
11}

4.6 Choosing the N-gram Size

CrypTool-2 supports 1-gram through 5-gram via CostFunctionSettings.NGramSize, with the LanguageStatisticsLib library also supporting 6-grams. The default is 5-gram (Pentagrams).

N-gram SizeArray DimensionsStorage (26-letter)GranularityTypical Use
1-gram26104 BSingle-letter frequencyEquivalent to IoC/Entropy
2-gram26^2 = 6762.6 KBLetter pairsQuick rough filtering
3-gram26^3 = 17,57668 KBShort patternsCommon in Enigma analysis
4-gram26^4 ≈ 460K1.7 MBMedium patternsGeneral analysis
5-gram26^5 ≈ 11.9M45 MBLong patternsCT2 default, highest precision
6-gram26^6 ≈ 309M1.2 GBVery long patternsMemory-constrained

Larger N means better accuracy, but storage grows exponentially. 5-gram strikes a practical balance between precision and memory.

The factory method in LanguageStatistics.cs (lines 180–198) reflects this design:

csharp
1public static Grams CreateGrams(int languageId, string languageStatisticsDirectory,
2                                 GramsType gramsType, bool useSpaces)
3{
4    switch (gramsType)
5    {
6        case GramsType.Unigrams:     return new Unigrams(...);
7        case GramsType.Bigrams:      return new Bigrams(...);
8        case GramsType.Trigrams:     return new Trigrams(...);
9        case GramsType.Tetragrams:   return new Tetragrams(...);
10        case GramsType.Pentragrams:  // CT2's default ngram size
11        default:                     return new Pentagrams(...);
12        case GramsType.Hexagrams:    return new Hexagrams(...);
13    }
14}

4.7 Multi-Language Support

LanguageStatistics.cs defines alphabets for 15 languages (lines 142–159):

csharp
1public static Dictionary<string, string> Alphabets = new Dictionary<string, string>()
2{
3    {"en", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // English: 26 letters
4    {"de", "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß" },                  // German: 30 letters (with umlauts)
5    {"fr", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // French
6    {"es", "ABCDEFGHIJKLMNOPQRSTUVWXYZÑ" },                     // Spanish: 27 letters
7    {"it", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Italian
8    {"hu", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Hungarian
9    {"ru", "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ" },               // Russian: 33 Cyrillic letters
10    {"cs", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Czech
11    {"la", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" },                      // Latin
12    {"el", "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ" },                        // Greek: 24 letters
13    {"nl", "ABCDEFGHIJKLMNOPQRSTUVWXYZ"},                       // Dutch
14    {"sv", "ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ"},                    // Swedish: 29 letters
15    {"pt", "ABCDEFGHIJKLMNOPQRSTUVWXYZ"},                       // Portuguese
16    {"pl", "AĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUVWXYZŹŻ"},              // Polish: 35 letters
17    {"tr", "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ" }                    // Turkish: 29 letters
18};

Each language has its own N-gram frequency files (e.g., en-5gram-nocs.bin, de-4gram-nocs-sp.bin), so scoring can target a specific language.

The code also hard-codes Unigram frequencies for each language (lines 118–140), with comments citing Wikipedia and practicalcryptography.com as data sources.


5. Method 4: Dictionary Matching

5.1 How It Works

The first three methods all produce probabilistic scores — they compute a number and pick the highest. Dictionary matching takes a more straightforward approach: how many real words can you find in the decrypted output? If you find enough, it's probably the right plaintext.

CrypTool-2's Caesar brute-force template (Templates/Cryptanalysis/Classic/Caesar_ExhaustiveKeySearch.xml) uses this workflow:

text
1Ciphertext → [Caesar component: iterate shift values 1–26]
2              → each candidate plaintext → [Contains component]
34                                  [Dictionary component: load language dictionary]
5
6Contains component: matched word count ≥ threshold → Result = true

5.2 Dictionary Data Structure: WordTree (Trie)

Source file: LibSource/LanguageStatisticsLib/WordTree.cs

The dictionary is stored as a Trie (prefix tree), serialized into a custom binary format (magic number CT2DIC), and GZip-compressed into .dic files. Each language has its own dictionary file (e.g., Dictionary_en.dic, Dictionary_de.dic).

The deserialization logic (lines 39–101) rebuilds the tree using a stack:

csharp
1public static WordTree Deserialize(BinaryReader reader)
2{
3    WordTree tree = new WordTree();
4
5    // 1. Validate magic number
6    string magicNo = new string(reader.ReadChars(6));
7    if (magicNo != "CT2DIC")
8    {
9        throw new Exception("File does not start with the expected magic number for word tree.");
10    }
11
12    // 2. Read language code and alphabet (0-terminated strings)
13    // ...
14
15    // 3. Rebuild the trie using a stack
16    Stack<Node> stack = new Stack<Node>();
17    stack.Push(tree);
18    int symbol;
19    while ((symbol = reader.Read()) != -1)
20    {
21        readChar = (char)symbol;
22        if (readChar == Node.WordEndSymbol)      // word-end marker
23        {
24            stack.Peek().WordEndsHere = true;
25            tree.StoredWords++;
26            continue;
27        }
28        if (readChar == Node.TerminationSymbol)  // node termination marker
29        {
30            stack.Pop();
31            continue;
32        }
33        Node newNode = new Node() { Value = readChar };
34        stack.Peek().ChildNodes.Add(newNode);
35        stack.Push(newNode);
36    }
37    return tree;
38}

The Contains component (source: CrypPlugins/Contains/Contains.cs) supports two matching strategies:

Strategy A: Aho-Corasick (default)

Aho-Corasick is a classic multi-pattern string matching algorithm (Aho & Corasick, 1975) that can search for all dictionary words simultaneously in O(N + M + Z) time (N = text length, M = total pattern length, Z = number of matches). It's a perfect fit for "search for tens of thousands of dictionary words in a text at once."

Implementation: CrypPlugins/Contains/Aho-Corasick/StringSearch.cs, core interface:

csharp
1public class StringSearch : IStringSearchAlgorithm
2{
3    public bool ContainsAny(string text);             // check if text contains any keyword
4    public StringSearchResult[] FindAll(string text); // find all matches
5    public StringSearchResult FindFirst(string text); // find first match
6}

The TreeNode inner class implements the Aho-Corasick automaton state nodes:

  • _failure: failure link pointer (built via BFS)
  • _results: list of keywords matched at this node
  • _transHash: transition function (character → child node mapping)

Strategy B: Hashtable word-by-word lookup

Dictionary words are loaded into a hash table. The input text is tokenized by delimiter and each token is looked up individually. Better suited when the input is already tokenized.

Search structure initialization (Contains.cs, lines 161–215):

csharp
1private void SetSearchStructure()
2{
3    if (settings.Search == ContainsSettings.SearchType.AhoCorasick)
4    {
5        stringSearch = new StringSearch(
6            dictionaryInputString.Split(settings.DelimiterDictionary[0]));
7    }
8    else if (settings.Search == ContainsSettings.SearchType.Hashtable)
9    {
10        hashTable = new Hashtable();
11        string[] theWords = dictionaryInputString.Split(settings.DelimiterDictionary[0]);
12        foreach (string item in theWords)
13        {
14            if (!hashTable.ContainsKey(item))
15            {
16                hashTable.Add(item, null);
17            }
18        }
19    }
20}

5.4 Decision Logic

Contains.cs, lines 337–338:

csharp
1HitCount = listReturn.Count;          // number of dictionary words matched
2Result = (HitCount >= settings.Hits);  // above threshold = "real language"

5.5 Limitations

Dictionary matching gives high-confidence results — finding actual words is deterministic evidence, unlike the fuzzy scores from statistical methods. But the tradeoffs are clear: every language needs a pre-built dictionary, and abbreviations, proper nouns, and spelling variants can cause misses. Aho-Corasick runs in O(N), so speed isn't an issue. This method works best for small key spaces (like Caesar's 25 shifts), where you can afford to check every candidate against the dictionary.


6. Bonus: CaesarAnalysisHelper's Heuristic Frequency Analysis

CrypTool-2 also has a component specifically for Caesar ciphers called CaesarAnalysisHelper (source: CrypPlugins/CaesarAnalysisHelper/CaesarAnalysisHelper.cs). It takes a completely different approach — instead of trial-decrypting and scoring, it infers the key directly from the ciphertext's statistical properties.

The corresponding template is Templates/Cryptanalysis/Classic/Caesar_Analysis_Using-character-frequencies.xml.

6.1 Algorithm Steps

Step 1: Single-letter frequency voting (CryptoAnalysis method, lines 83–101)

csharp
1Dictionary<int, int> KeyList = new Dictionary<int, int>();
2int Counter = 0;
3foreach (int i in CountChars(frequencyList.ToLower()))
4{
5    if (Counter < 5)
6    {
7        if (!KeyList.ContainsKey(i))
8        {
9            KeyList.Add(i, 5 - Counter);  // weighted voting: rank 1 gets weight 5, ...
10        }
11        else
12        {
13            KeyList[i] += 5 - Counter;
14        }
15        Counter++;
16    }
17}

The idea is straightforward: if the most frequent letter in the ciphertext is H, and the most frequent letter in English is E (configured via settings.FrequentChar), then the key is probably H - E = 3.

The CountChars method (lines 135–188) parses the frequency list and computes the offset between each high-frequency letter and the target language's most frequent letter.

Step 2: Bigram frequency voting (CountBigrams method, lines 190–257)

csharp
1// Hard-coded German high-frequency bigrams
2string[] Bigrams = new[] { "er", "en", "ch", "de" };

The key observation: Caesar encryption preserves the distance between the two letters within a bigram. For example, in "er", r - e = 13. After encryption it might become "hu" (u - h = 13). By matching on distances, candidates can be identified quickly:

csharp
1// Line 232: matching based on the invariant character distance
2} while (!(CurrentBigramm[1] - CurrentBigramm[0] == s[1] - s[0]));

Once a match is found, the key is inferred via s[0] - CurrentBigramm[0] (line 236).

Step 3: Combined voting (lines 120–132)

csharp
1IOrderedEnumerable<int> items = (from k in KeyList.Keys
2                                 orderby KeyList[k] descending
3                                 select k);
4List<int> ResultList = new List<int>();
5foreach (int i in items)
6{
7    ResultList.Add(i);
8}
9if (ResultList.Count > 0)
10{
11    key = ResultList[0];  // highest score = most likely key
12}

6.2 Assessment

This method skips trial decryption entirely and guesses the key directly. The upside: no need to iterate through all candidates. The downside: if the ciphertext is too short, the frequency statistics won't be stable enough and the guess may be off.

Also worth noting: CountBigrams hard-codes German bigrams ("er", "en", "ch", "de"), so this component is biased toward German. For other languages, you'd need to swap in the corresponding high-frequency bigrams.


7. Overall Architecture

Putting it all together, CrypTool-2's language identification system has roughly four layers:

text
1┌─────────────────────────────────────────────────┐
2│            Application Layer (Analyzers)        │
3│  Caesar Brute Force / Vigenere / Enigma / ...   │
4├─────────────────────────────────────────────────┤
5│          Evaluation Interface (IControlCost)    │
6│  CostFunction (IOC | Entropy | NGramsLog2|RegEx)│
7│  RelationOperator: LargerThen / LessThen        │
8├─────────────────────────────────────────────────┤
9│       Language Statistics Engine                │
10│       (LanguageStatisticsLib)                   │
11│  Grams family (1–6 gram) │ WordTree(Trie dict)  │
12│  LanguageStatistics      │ CalculateIoC         │
13├─────────────────────────────────────────────────┤
14│                 Data Layer                      │
15│  CTLS format N-gram frequency files (.bin.gz)   │
16│  CT2DIC format dictionary files (.dic)          │
17│  Hard-coded Unigram frequencies (15 languages)  │
18└─────────────────────────────────────────────────┘

Interface Design

IControlCost (CrypPluginBase/Control/IControlCost.cs) is the public contract for the cost function system:

csharp
1public enum RelationOperator
2{
3    LessThen, LargerThen
4}
5
6public interface IControlCost : IControl
7{
8    RelationOperator GetRelationOperator();  // higher is better, or lower is better?
9    double CalculateCost(byte[] text);       // compute the cost value
10    int GetBytesToUse();
11    int GetBytesOffset();
12}

Analyzers program against the IControlCost interface, not the concrete scoring algorithm. To switch scoring methods, you just change the CostFunctionSettings.CostFunctionType enum value — no analyzer code needs to change.

Cost Function Type Enum

csharp
1// CostFunctionSettings.cs, lines 27–33
2public enum CostFunctionType
3{
4    IOC = 0,        // Index of Coincidence
5    Entropy = 1,    // Shannon Entropy
6    NGramsLog2 = 2, // N-gram log-probability (the core method)
7    RegEx = 3       // Regular expression matching
8}

8. Comparison

MethodMathematical BasisTime ComplexityExtra Data RequiredDiscriminationBest Use Case
IoCDiscrete probabilityO(N)NoneLowQuick filtering; key length estimation
EntropyInformation theoryO(N)NoneLowQuick filtering
N-gramStatistical language modelO(N)Large N-gram frequency tableHighGo-to for all cryptanalysis scenarios
DictionaryExact string matchingO(N+M+Z)Language dictionaryVery highSmall key space exhaustive search
Frequency inferenceFrequency analysis heuristicO(N)Target language frequency priorMediumQuick Caesar / monoalphabetic cracking

Which One to Use in Practice?

For Caesar ciphers (key space of just 25), CrypTool-2 provides two ready-made templates:

  1. Dictionary matching (Caesar_ExhaustiveKeySearch.xml): try all 25 shift values, run each result through Aho-Corasick dictionary detection, and stop when enough words are found. Deterministic results.
  2. Direct frequency inference (Caesar_Analysis_Using-character-frequencies.xml): use CaesarAnalysisHelper to compute the key from ciphertext frequencies directly, without iterating through candidates. Fastest.

For more complex ciphers (Vigenere, Enigma, Playfair, etc.), where the key space is too large to enumerate, N-gram scoring becomes the fitness function for search algorithms like hill climbing, genetic algorithms, and simulated annealing — guiding the search toward "more language-like" results.


9. Comparison with LLMs

DimensionTraditional Statistical Methods (CrypTool-2)LLM Approach
Basis for judgmentLetter combination frequency patternsDeep semantic understanding
Data requirementsMB-scale frequency tables + dictionariesGB-scale model parameters
Computation speedMicroseconds (per evaluation)Milliseconds to seconds
Semantic understandingNone (pure statistics)Yes (understands meaning)
ExplainabilityFully transparentBlack box
Multi-language extensionNeeds per-language frequency dataCovered by training data
Offline capabilityFully offlineUsually requires inference service

Traditional methods don't need to "understand" what the language says — they just need to catch its statistical fingerprint. This line of thinking goes all the way back to Friedman (1922) and Shannon (1948) and is still the foundation of cryptanalysis today.

Looked at differently, the N-gram method is really just a Markov chain language model — it assumes the probability of the current letter depends only on the previous N-1 letters. Ravi and Knight (2008) showed that even low-order character N-gram models can effectively solve substitution ciphers. This is mathematically the same lineage as modern NLP language models, just several orders of magnitude simpler. CrypTool-2's 5-gram scoring is, at the end of the day, a 5th-order Markov chain LM — crude, but it gets the job done for cryptanalysis.


10. Conclusion

Looking back, what CrypTool-2 does here answers a pretty basic question: natural language isn't random. Its letter distributions, letter combinations, and vocabulary all follow patterns. Capture those patterns, and you can tell real language from gibberish.

Each method has its own granularity: IoC and Entropy just check whether single-letter frequencies are "skewed" enough; N-gram scoring checks whether letter sequences match the habits of a language; dictionary matching goes straight for the kill — finding actual words is hard evidence.

None of these methods are obsolete. They run in microseconds, produce deterministic results, need no GPU, and can be called repeatedly inside a search loop. When you need to evaluate billions of candidate keys, N-gram scoring is still the standard approach.


Appendix: Key Source File Index

File PathFunction
CrypPlugins/CostFunction/CostFunction.csUnified entry point for all four cost functions (IoC/Entropy/NGram/RegEx)
CrypPlugins/CostFunction/CostFunctionSettings.csCost function configuration (type/language/N-gram size)
CrypPluginBase/Control/IControlCost.csCost function interface (RelationOperator/CalculateCost)
LibSource/LanguageStatisticsLib/Grams/Grams.csN-gram scoring abstract base class (template method pattern)
LibSource/LanguageStatisticsLib/Grams/Pentagrams.cs5-gram scoring implementation (CT2 default)
LibSource/LanguageStatisticsLib/Grams/Tetragrams.cs4-gram scoring implementation
LibSource/LanguageStatisticsLib/LanguageStatistics.csLanguage statistics facade (15 language alphabets + Unigram frequencies)
LibSource/LanguageStatisticsLib/LanguageStatisticsFile.csCTLS file format parser (Magic: CTLS + GZip + BlockCopy)
LibSource/LanguageStatisticsLib/WordTree.csTrie dictionary data structure (Magic: CT2DIC)
CrypPlugins/CaesarAnalysisHelper/CaesarAnalysisHelper.csHeuristic frequency inference for key recovery
CrypPlugins/Contains/Contains.csAho-Corasick / Hashtable dictionary matching
CrypPlugins/Contains/Aho-Corasick/StringSearch.csAho-Corasick automaton implementation
Util/LanguageStatisticsGenerator/Program.csN-gram frequency data generator (Gutenberg/Wikipedia)
Templates/Cryptanalysis/Classic/Caesar_ExhaustiveKeySearch.xmlCaesar exhaustive search template (dictionary matching)
Templates/Cryptanalysis/Classic/Caesar_Analysis_Using-character-frequencies.xmlCaesar frequency analysis template

References

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340. https://doi.org/10.1145/360825.360855

Friedman, W. F. (1922). The index of coincidence and its applications in cryptography (Riverbank Publication No. 22). Riverbank Laboratories.

Kopal, N. (2018). Solving classical ciphers with CrypTool 2. In Proceedings of the 1st International Conference on Historical Cryptology (HistoCrypt 2018) (Vol. 149, pp. 29–38). Linköping University Electronic Press.

Kopal, N., & Esslinger, B. (2022). New ciphers and cryptanalysis components in CrypTool 2. In Proceedings of the International Conference on Historical Cryptology (HistoCrypt 2022) (pp. 127–136). Linköping University Electronic Press.

Ravi, S., & Knight, K. (2008). Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 812–819). Association for Computational Linguistics.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Enjoyed this article?

Share it with your friends and colleagues!

Welcome
Last updated: March 5, 2026
相关文章
正在检查服务状态...
Before Ai, How Did Programs Tell Real Lang from Gibberish? - ICTRUN