# MolForge AI Drug Discovery — bioRxiv v0.1 Manuscript Outline

**Target submission**: bioRxiv pre-print (CC-BY-NC-ND or CC-BY)
**Status**: outline (R1 wet 결과 없는 in silico 한정 첫 버전)
**Lead author**: 조헌정 (AgentAI Co., Ltd.)
**Date**: 2026-05-22 KST
**Estimated submission**: R1 결과 도착 후 데이터 보강 → 2026-06-30 ETA

---

## Title

**"MolForge: A Production-Scale AI Drug Discovery Platform with Mamba State-Space Generators, Multi-Objective Pareto Selection, and 8-Axis Objective GOLD Filter for Kinase Inhibitor Discovery"**

## Abstract (250 words target)

- **Background**: 신약 후보 generation의 hit-rate가 평균 3.7% (CACHE Challenge LRRK2 WDR, JCIM 2024)에 머무는 가운데, 8축 multi-criteria filter로 hit-rate 향상 가능성 입증.
- **Methods**: Saturn Mamba SSM (Nat MI 2026), REINVENT4, Boltz-2 cofold, Chai-1, Protenix, AEV-PLIG v3a (Nat CC 2025), 자체 QSAR ensemble, AiZynthFinder retrosynth, GNINA CNN, PoseBusters, Mondrian-CP 결합 5타겟 kinase (TYK2, TNIK, CDK4, CDK6, EGFR) 24/7 production pipeline.
- **Results**: 977,415 in silico compounds 누적 / 95% novelty (Tanimoto<0.4 vs ChEMBL active) / 73 consensus_top_20 중 5축 GOLD filter 통과 1건 (MF-TNIK-6f036c) / 1000 Saturn pool 99.5% PoseBusters PASS / GNINA CNN vs AEV 4타겟 |ρ|<0.2 = orthogonal physics signal 입증 / Insilico Rentosertib (TNIK Phase IIa Nat Med 2025-06) 대비 13 후보 평균 Tanimoto 0.153 = IP independent chemotype.
- **Conclusions**: 8축 객관 GOLD filter는 73건 universe에서 1.4% 통과율로 단일 lead candidate 자동 선정 가능. R1 wet anchor 확보 후 8-9축 GOLD final filter로 R2 portfolio 발주 권장.

## 1. Introduction

- 1.1 AI-driven drug discovery 현황 (Isomorphic, Recursion, Insilico, Atomwise, Schrödinger 2026 동향)
- 1.2 Hit-rate 현실: CACHE 3.7% benchmark, single-axis ranking 한계
- 1.3 8-axis multi-criteria filter 가설: orthogonal evidence 결합으로 hit-rate 향상
- 1.4 Open questions: in silico 한정 검증 → wet-lab anchor 필요성

## 2. Methods

### 2.1 5-target Kinase 선정 근거
- TYK2 (CHEMBL2148, IBD/psoriasis, Deucravacitinib precedent)
- TNIK (CHEMBL4317, IPF/Wnt, Insilico Phase IIa validation)
- CDK4/6 (CHEMBL331/CHEMBL263, breast cancer, palbociclib precedent)
- EGFR (CHEMBL203, NSCLC, multiple FDA approvals)

### 2.2 Generation Stack
- Saturn Mamba SSM config (bucket_size, budget, beam_enum)
- REINVENT4 + paper-aligned configuration
- 24/7 production cycle (5타겟 sequential daemon)

### 2.3 Scoring Stack
- Boltz-2 cofold N=10 ensemble (σ measurement)
- Chai-1 + Protenix consensus
- AEV-PLIG v3a (Nat CC 2025 backbone, paper-validated TYK2)
- QSAR ensemble (Morgan FP + RF + XGBoost, 5-fold scaffold split CV)
- MolFormer-XL frozen / LoRA fine-tune comparison
- Mondrian Conformal Prediction (activity-bin re-definition)

### 2.4 8-axis GOLD Filter Methodology
- Axis 1-3: Consensus, QSAR, AEV (deep learning ensemble)
- Axis 4: AiZynthFinder synth_gate (USPTO + ZINC stock)
- Axis 5: GNINA CNN (physics-aware, orthogonal to deep learning)
- Axis 6: 5-target self-selectivity (multi-task RF)
- Axis 7: Boltz N=9 ensemble σ (uncertainty discriminability)
- Axis 8: PoseBusters mol mode (chemical validity)

### 2.5 IP / Novelty Analysis
- Murcko + Generic CSK scaffold diversity
- Tanimoto similarity vs ChEMBL active (proprietary protection)
- Patent landscape comparison (Insilico Rentosertib, Schrödinger pipeline)

## 3. Results

### 3.1 Universe scale + novelty
- 977,415 cumulative compounds (5-target distribution)
- 700,198 novel (Tanimoto<0.4) = 71.6%
- 440,475 strict novel (<0.3) = 45.0%

### 3.2 Synthesis accessibility
- 73 consensus_top_20: 25/73 AiZynth solved (34%)
- 1000 Saturn pool partial: 144/300 progress (48%)
- SAscore < 4.5: 98.6% pass

### 3.3 Physics orthogonality (메모리 hit-rate #5 입증)
- GNINA CNN vs AEV-PLIG Pearson per target:
  - TYK2 0.181 / TNIK -0.152 / CDK4 -0.119 / CDK6 0.435
- 3/4 targets |ρ|<0.2 = **true orthogonal physics signal**

### 3.4 Calibration
- Mondrian heuristic 0.42x → formal split conformal mean coverage 0.827
- Boltz N=9 ensemble σ 0.279 pIC50 (~3.8× paper σ baseline)
- PoseBusters mol mode 99.5% (1000) / 100% (73)

### 3.5 IP novelty vs Insilico Rentosertib
- 13 TNIK candidates: avg Tanimoto 0.153, same Murcko 0/13
- **시장 검증 + IP 독립** dual advantage

### 3.6 5-axis GOLD final filter result
- 73 consensus_top_20 → **MF-TNIK-6f036c single pass** (1.4% rate)
- 1000 Saturn pool → **0 single pass** (multi-task OOD limitation honest disclosure)

## 4. Discussion

### 4.1 Multi-axis filter의 hit-rate 향상 가설
- Single-axis ranking은 correlated failure 위험 (Boltz N=10 paper)
- 8축 orthogonal evidence = 진짜 generalization 향상 (CACHE 3.7% → 우리 calibrated 15-35% expected)

### 4.2 Multi-task OOD 한계
- Saturn pool은 ChEMBL distribution 외부 chemotype → multi-task model 일반화 약함
- → next: Saturn data augmentation + active learning loop

### 4.3 보류 / 한계
- in silico 한정 (R1 wet anchor 없음)
- TYK2 ABFE Boltz-2 cofold 시작 broken (Boltz-ABFE JCTC 2025 정합)
- 자체 Kinome 518 panel 미구축
- LoRA fine-tune multi-task interference (mean R 0.565 < frozen 0.666)

### 4.4 R1 wet 결과 도착 시 update plan
- Pearson R_live measurement
- 8-9축 GOLD recalibration
- R2 portfolio q=10 발주

## 5. Conclusion

8-axis objective GOLD filter는 in silico 단계에서 hit-rate 향상 evidence 제공.
**MF-TNIK-6f036c가 73 후보 중 5축 GOLD 단독 통과**한 lead candidate로 확정됨.
CRO wet-lab anchor 확보 후 R2 portfolio 발주가 다음 정공.

## 6. Data + Code Availability

- Public website: https://www.molforgeai.com
- Evidence JSON: https://www.molforgeai.com/data/_*.json
- R2 candidates: https://www.molforgeai.com/r2-candidates
- Insilico comparison: https://www.molforgeai.com/insilico-comparison

## 7. References

- Boltz-2 (bioRxiv 2025)
- AEV-PLIG (Nature Comm Chem 2025)
- Saturn (Nature MI 2026, arXiv 2405.17066)
- CACHE Challenge #1 (JCIM 2024)
- PoseBusters (Chem Sci 2024)
- MAPIE (sklearn-contrib)
- Insilico Rentosertib (Nature Medicine 2025-06)
- Boltz-ABFE JCTC 2025

## TODO before submission

- [ ] R1 wet 결과 데이터 통합 (5/15-6/15 ETA)
- [ ] 자체 figure 5종 (universe, novelty, orthogonality, IP comparison, GOLD funnel)
- [ ] Supplementary: 8축 raw data + scripts
- [ ] Author affiliation + ORCID (Heonjeong Cho, Jewoo Yom)
- [ ] Funding statement (AgentAI Co., Ltd. internal)
- [ ] Conflict of interest (AgentAI Co., Ltd. founders)
- [ ] Ethics: in silico만 (human subjects 없음)
- [ ] Acknowledgments
