✦ Agentic Reasoning for Tree Search

Learning the ARTS of Search
for Automated Discovery

Using a reasoning agent to search the space of hypotheses instead of heuristics.

Gurusha Juneja¹, Arnav Kumar Jain², Deepak Nathani¹, William Yang Wang¹, Xin Eric Wang¹

¹University of California, Santa Barbara · ²Université de Montréal & Mila – Quebec AI Institute

The scientist inspects each node's logs and failures to reason about why it scored low, then proposes diverse hypotheses through verbalized sampling. When the search history outgrows the context window, ARTS folds it into the model's weights with test-time training — so a 4B model matches an o3-scale model at about a fifth of the inference cost.

A single search step of ARTS.

Abstract

Scientific discovery can be formulated as an iterative search process over the space of hypotheses and experiments. Contemporary methods navigate this space using heuristics such as MCTS. These algorithms conflate the merit of a hypothesis with the quality of its experimental execution. A promising hypothesis with preliminary execution is therefore ranked below a modest hypothesis whose execution is refined. Moreover, prior methods prune the search logs as the search progresses because the accumulated history outgrows the context window. We propose Agentic Reasoning for Tree Search (ARTS), where we deploy a reasoning language model to navigate this space. The model inspects prior execution logs, diagnoses whether earlier failures arose from faulty implementations or bad hypotheses, and selects the hypothesis to build on next. To mitigate challenges with context length, ARTS uses test-time training to instill the knowledge of search tree in the model weights. Across 22 tasks from MLGym and MLEBench, we show that ARTS outperforms leading algorithms, with over 15.3% relative improvement in the normalized score. With test-time training we show that a Qwen3-4B agent can match performance with closed-source frontier models like Gemini-3 Pro and GPT o3-reasoning with upto 5× lower inference cost. We further observe that on partially observable RL tasks, the test-time trained Qwen3-4B scientist surpasses ARTS with the o3 scientist by rediscovering the human-best recurrent-memory solution that heuristic methods prune away.

Results

16 / 22

ML tasks where ARTS beats prior agents

+15.3%

relative improvement in normalized score

ARTS* 4B ≈ o3

a test-time-trained 4B model nearly matches GPT o3

1/5th cost

ARTS* reaches it at about a fifth of o3's inference cost

Aggregate normalized scores: mean, median, IQM and optimality gap. — **Reliable metrics.** ARTS leads on mean, median and IQM with the smallest optimality gap (reliable metrics of Agarwal et al., 95% bootstrap CIs).

Per-task normalized score across all evaluated tasks. — **Performance progression with time.** Normalized score over wall-clock hours on each task. ARTS often trails early but pulls ahead, because it explores diverse hypotheses instead of committing to one too soon.

🧭

Rediscovering solutions heuristics throw away

On the partially observable MetaMaze task, the test-time-trained Qwen3-4B scientist rediscovers the human-best recurrent-memory solution that MCTS-style methods prune early — outperforming even ARTS with an o3 scientist.

How it works

Four ideas behind ARTS

🔍

Inspect and reason about failures

ARTS reads each node's logs and reasons about why it scored low — separating a faulty implementation from a genuinely weak hypothesis — then reasons about which node and which hypothesis are most worth pursuing next, instead of greedily following the score.

📋

Audit the baseline to learn its structure

Before it changes anything, ARTS audits the baseline code — its architecture, training loop and data pipeline — so it understands where each hypothesis can actually be applied and how to implement it correctly.

🎲

Diverse hypotheses via verbalized sampling

Reasoning models driving search tend to suffer diversity collapse. ARTS uses verbalized sampling to elicit a varied set of candidate hypotheses at each expansion, keeping the search broad while staying grounded in prior evidence.

⚙️

Test-time training to remember the tree

As history outgrows the context window, prior methods prune it and lose information. ARTS instead bakes the search tree into the model's weights with test-time training, letting a small Qwen3-4B scientist rival frontier models at a fraction of the cost.

Key findings

What the search learns to do

efficiency

ARTS gives the largest gains on tasks where each hypothesis is a costly run and committing to the wrong direction early is expensive.

diversity

Verbalized sampling keeps candidate hypotheses diverse across an expansion, so the search escapes local optima instead of collapsing onto one idea.

reliability

Gains hold on the interquartile mean, not just the average — improvement comes from typical runs, not a few lucky seeds.

scale

Test-time training lets a 4B open model match closed frontier models, and the advantage grows as the search history gets longer.

Search trees

ARTS vs. AIRA on real tasks

Every node is one validated experiment. Pick a task to see how each method explores.

Simplified for brevity — not the verbatim trees. For the full annotated search trees, see Appendix A–C of the paper.

Cite

BibTeX

@article{juneja2026arts,
  title   = {Learning the ARTS of Search for Automated Discovery},
  author  = {Juneja, Gurusha and Jain, Arnav Kumar and Nathani, Deepak and Wang, William Yang and Wang, Xin Eric},
  journal = {Preprint},
  year    = {2026}
}