Home AGI Automatic template extraction

Automatic template extraction

History / Edit / PDF / EPUB / BIB /
Created: July 14, 2016 / Updated: December 21, 2025 / Status: in progress / Readability: technical / 3 min read (~508 words)
artificial-general-intelligence

To extract patterns, group them by starting character, then test how many have the same following character
Grammar induction
Compression
- Compression can be a tool for automatic template extraction, however we would most likely want to priority semantics of the extracted template over better compression
Diff/match/patch
Fragment extraction, then wildcard pattern generation
Lexer-like that will replace a whole sequence if it is already in the grammar instead of doing character by character replacement like sequitur

Extract textual templates from any language (basically tries to find repetitions/patterns)
Min/max length (characters)
Discovery of syntax
Hierarchical/meta extraction

<...> is a placeholder (can be replaced/is variable)


public function getClassification()
{
    return $this->classification;
}

public function setClassification($classification)
{
    $this->classification = $classification;

    return $this;
}


public function get()
{
    return $this->;
}

public function set()
{
    $this-> = ;

    return $this;
}


class SomeClass extends Model {
    protected $table = 'some_table';

    protected $fillable = ['field_1', 'field_2'];
}


class  extends Model {
    protected $table = '';

    protected $fillable = [];
}


if ($someCondition) {
    // Do something
}


if ()

Create a dictionary of all seen characters
Create a dictionary of characters -> index
Define some sort of relative threshold for which to ignore patterns
You have a single string, you want to extract patterns out of it
You have two strings, you want to extract patterns out of them

How to extract simple constructs such as if/elseif/else/while/do/for/foreach?
How to compress aaaabbbb into an expanding aCb -> aaCbb -> aaaCbbb -> aaaabbbb vs AB -> aaaaB -> aaaabbbb
- aaaabbbb -> aaaCbbb -> aaDbb -> aEb -> F
  - C := ab
  - D := aCb
  - E := aDb
  - F := aEb
  - C := aCb
    -> This is a context-free grammar
Do we want to prioritize short rules such as S -> Sa such that they can be repeated many times, or rules that contains a lot of symbols such as S -> aSa
- Probably want to minimize the number of rules/productions
- Probably want to minimize the rule length
From [^1]
- p1: no pair of adjacent symbols appears more than once in the grammar;
- p2: every rule is used more than once.
How can we prefer public function <>(<>) {<>} over } public function <>(<>) {?
- If we refer to an explicit grammar, we can give more weight to the first one because it is likely a construct/production in the grammar, while the second one is the concatenation of two productions