Automatic template extraction

History / Edit / PDF / EPUB / BIB
Created: July 13, 2016 / Updated: May 16, 2018 / Status: in progress / 5 min read (~919 words)

  • To extract patterns, group them by starting character, then test how many have the same following character
  • Grammar induction
  • Compression
    • Compression can be a tool for automatic template extraction, however we would most likely want to priority semantics of the extracted template over better compression
  • Diff/match/patch
  • Fragment extraction, then wildcard pattern generation
  • Lexer-like that will replace a whole sequence if it is already in the grammar instead of doing character by character replacement like sequitur

  • Extract textual templates from any language (basically tries to find repetitions/patterns)
  • Min/max length (characters)
  • Discovery of syntax
  • Hierarchical/meta extraction

<...> is a placeholder (can be replaced/is variable)


public function getClassification()
{
    return $this->classification;
}

public function setClassification($classification)
{
    $this->classification = $classification;

    return $this;
}

public function get()
{
    return $this->;
}

public function set()
{
    $this-> = ;

    return $this;
}

class SomeClass extends Model {
    protected $table = 'some_table';

    protected $fillable = ['field_1', 'field_2'];
}

class  extends Model {
    protected $table = '';

    protected $fillable = [];
}


if ($someCondition) {
    // Do something
}

if ()
    

  • Create a dictionary of all seen characters
  • Create a dictionary of characters -> index
  • Define some sort of relative threshold for which to ignore patterns
  • You have a single string, you want to extract patterns out of it
  • You have two strings, you want to extract patterns out of them

  • How to extract simple constructs such as if/elseif/else/while/do/for/foreach?
  • How to compress aaaabbbb into an expanding aCb -> aaCbb -> aaaCbbb -> aaaabbbb vs AB -> aaaaB -> aaaabbbb

    • aaaabbbb -> aaaCbbb -> aaDbb -> aEb -> F

      • C := ab
      • D := aCb
      • E := aDb
      • F := aEb

      • C := aCb
        -> This is a context-free grammar
  • Do we want to prioritize short rules such as S -> Sa such that they can be repeated many times, or rules that contains a lot of symbols such as S -> aSa
    • Probably want to minimize the number of rules/productions
    • Probably want to minimize the rule length
  • From [^1]
    • p1: no pair of adjacent symbols appears more than once in the grammar;
    • p2: every rule is used more than once.
  • How can we prefer public function <>(<>) {<>} over } public function <>(<>) {?
    • If we refer to an explicit grammar, we can give more weight to the first one because it is likely a construct/production in the grammar, while the second one is the concatenation of two productions