可直接接在 Algorithm 1 后面的
Evaluation Protocol(评估协议),写法对齐 NeurIPS / ICML / AAAI / FAccT / AI & Society 的方法论标准,强调可复现性、可审计性、去主观性。
4. Evaluation Protocol
Experimental Design for Verifying Semantic Emergence
4.1 Model Selection
To avoid architecture- or training-specific bias, models are selected according to the following criteria:
- Independent training pipelines Models must be trained by different organizations or with demonstrably distinct datasets and optimization procedures.
- Heterogeneous architectures (when possible) Priority is given to models with differing: parameterization strategies, alignment pipelines, decoding defaults.
- Black-box accessibility No access to internal weights, embeddings, or training data is assumed.
Let the resulting model set be:M={M1,M2,…,Mk}\mathcal{M} = \{M_1, M_2, \dots, M_k\}M={M1,M2,…,Mk}
4.2 Prompt Construction and Controls
4.2.1 Compressed Semantic Representation
The compressed semantic representation S^\hat{S}S^ must satisfy:
- Explicit definition of core concepts CCC
- Explicit relations RRR
- Explicit constraints Φ\PhiΦ
- No dialogue history or contextual priming
This ensures that any observed structure is induced by semantic content alone.
4.2.2 Decoding Control
To reduce stochastic variance, all model calls use:
- Fixed temperature (e.g., T=0.2T = 0.2T=0.2)
- Fixed max token length
- No tool use or external memory
Randomness is controlled solely via replicate runs rrr.
4.3 Transferability Evaluation
4.3.1 Transformation Set
Define a finite set of admissible transformations:T={T1,T2,…,Tm}\mathcal{T} = \{T_1, T_2, \dots, T_m\}T={T1,T2,…,Tm}
Each transformation must preserve semantic intent while altering surface realization.
Typical classes include:
- Linguistic paraphrase
- Domain transfer (e.g., philosophy → software systems)
- Task reframing (analysis → classification → explanation)
4.3.2 Scoring
For each Mi∈MM_i \in \mathcal{M}Mi∈M and Tj∈TT_j \in \mathcal{T}Tj∈T:
- Extract structure Si,jS_{i,j}Si,j
- Compute equivalence score Equiv(Si,j,Sref)\text{Equiv}(S_{i,j}, S_{\text{ref}})Equiv(Si,j,Sref)
Transferability score is aggregated using a robust statistic (median or trimmed mean).
4.4 Cross-Model Reproducibility Evaluation
Each model receives only S^\hat{S}S^ (no transformations, no history).
For each MiM_iMi:
- Run rrr stochastic replicates
- Extract Si1,…,SirS_{i}^1, \dots, S_{i}^rSi1,…,Sir
- Compute equivalence against SrefS_{\text{ref}}Sref
The reproducibility score is the aggregate across all models.
4.5 Compressibility Evaluation
Compressibility is evaluated via two orthogonal measures:
- Size Reduction RatioSR=∣S^∣∣D∣withSR≪1\text{SR} = \frac{|\hat{S}|}{|D|} \quad\text{with}\quad \text{SR} \ll 1SR=∣D∣∣S^∣withSR≪1
- Adequacy PreservationAP=Equiv(SS^,Sref)\text{AP} = \text{Equiv}(S_{\hat{S}}, S_{\text{ref}})AP=Equiv(SS^,Sref)
The compressibility score is a joint function of SR and AP.
4.6 Threshold Selection and Decision Rule
A semantic structure is accepted as emergent if:Transferability≥θ ∧ Reproducibility≥θ ∧ Compressibility≥θ\text{Transferability} \ge \theta \;\wedge\; \text{Reproducibility} \ge \theta \;\wedge\; \text{Compressibility} \ge \thetaTransferability≥θ∧Reproducibility≥θ∧Compressibility≥θ
Threshold θ\thetaθ is chosen a priori (e.g., 0.8) and held constant across experiments.
4.7 Reporting and Reproducibility
Each experiment reports:
- Model identities (or anonymized IDs if required)
- All transformation functions T\mathcal{T}T
- Full score distributions (not just means)
- Random seeds or replicate counts
- Failure cases and diagnostic notes
No qualitative judgments are permitted without accompanying quantitative scores.
4.8 Interpretation Guidelines
- Passing all criteria ⇒ semantic emergence supported
- Failing any single criterion ⇒ emergence rejected
- Partial passes ⇒ inconclusive; refinement required
Importantly, rejection does not imply error or incompetence of a model—only that the structure does not meet emergence criteria.
4.9 Summary Statement
Semantic emergence is not inferred from impressiveness, but demonstrated through constraint-respecting reproducibility.
Comments (0)
No comments