Autonomous Learning

Pipeline skript: ai_autolearn.py

Poradi kroku:

1. crawl_web_sources.py - crawl allowlisted web zdroju

2. self_learn_from_db.py - sbira kvalitni pary z MySQL

3. validate_training_data.py - validace datasetu

4. build_qa_text.py - build raw treninkoveho textu

5. prepare_dataset.py - tokenizace train/val

6. train.py - trenink candidate checkpointu

7. ai_autodeploy.py - safe promote candidate -> prod

Spusteni


python ai_autolearn.py

Web crawling env

  • ENOAI_WEB_SOURCES - start URL allowlist (csv)
  • ENOAI_WEB_MAX_PAGES
  • ENOAI_WEB_MAX_CHARS_PER_PAGE
  • ENOAI_WEB_TIMEOUT
  • ENOAI_WEB_USER_AGENT

Crawler:

  • respektuje robots.txt
  • drzi se stejneho hostu
  • uklada do data/raw/web_crawl_knowledge.txt

Self-learning env (DB pary)

  • ENOAI_SELF_LEARN_WINDOW_DAYS
  • ENOAI_SELF_LEARN_MIN_ROWS
  • ENOAI_SELF_LEARN_MAX_PAIRS
  • ENOAI_SELF_LEARN_MIN_ASSISTANT_LEN
  • ENOAI_SELF_LEARN_MAX_ASSISTANT_LEN
  • ENOAI_SELF_LEARN_ALLOWLIST_TERMS
  • ENOAI_SELF_LEARN_BLOCKLIST_TERMS

Quality weighting

  • ENOAI_SELF_LEARN_WEIGHT_EDITED
  • ENOAI_SELF_LEARN_WEIGHT_GOOD
  • ENOAI_SELF_LEARN_WEIGHT_APPROVED
  • ENOAI_SELF_LEARN_WEIGHT_DEFAULT
  • ENOAI_SELF_LEARN_MAX_WEIGHT_DUPES
  • ENOAI_SELF_LEARN_DECAY_HALFLIFE_DAYS
  • ENOAI_SELF_LEARN_DECAY_MIN_FACTOR

Safe deploy gate

Skript: ai_autodeploy.py

Candidate checkpoint:

  • enoa_gpt/checkpoints/enoagpt_tiny_final.pt

Production checkpoint:

  • enoa_gpt/checkpoints/enoagpt_tiny_prod.pt

Promote probehne jen kdyz:

  • mini eval score >= ENOAI_AUTODEPLOY_MIN_SCORE
  • score gain >= ENOAI_AUTODEPLOY_MIN_GAIN
  • (volitelne) DB A/B eval gate projde

DB A/B env:

  • ENOAI_AUTODEPLOY_DB_EVAL
  • ENOAI_AUTODEPLOY_DB_EVAL_MIN_SAMPLES
  • ENOAI_AUTODEPLOY_DB_EVAL_MIN_DELTA
  • ENOAI_DB_EVAL_WINDOW_DAYS
  • ENOAI_DB_EVAL_MAX_SAMPLES
  • ENOAI_DB_EVAL_MIN_ASSISTANT_LEN

Report:

  • enoa_gpt/checkpoints/autodeploy_last.json