Google TabFM: Ensembling Wins Even for Tables
On June 30, 2026, Google Research released TabFM, a zero-shot foundation model for tabular data. It predicts on a brand-new table in one forward pass. No training, no tuning, no feature engineering. On the TabArena benchmark it beats heavily tuned gradient-boosted trees.
That headline is the news. The detail worth your attention is quieter. TabFM's top result does not come from the single-pass model. It comes from a 32-way ensemble of that model. Even for a state-of-the-art tabular network, ensembling still wins.
If you have spent time with tabular data, that will feel familiar. This article looks at why, and asks whether the same intuition transfers when the models are large language models instead of trees.
Time to read: 7-9 minutes.
What is TabFM?
TabFM reframes tabular prediction as in-context learning. You hand it the labeled rows and the rows you want answered as a single input. It reads the labeled portion as context and predicts the rest. There is no gradient step on your data.
Its architecture is a three-stage hybrid-attention pipeline, per the model card:
- Column attention embeds each cell and attends across features to find relationships between columns.
- Row compression summarizes each row into a dense vector, which cuts the cost of the next stage.
- In-context transformer attends over the compressed rows, treating your training rows as context for the test rows.
It was trained on hundreds of millions of synthetic datasets generated from structural causal models, chosen because diverse, high-quality open tabular data is scarce. The weights ship under a non-commercial license; the code is Apache-2.0 on GitHub.
How much does ensembling add inside TabFM?
Google ships two presets. Plain TabFM is the single out-of-the-box forward pass. TabFM-Ensemble combines 32 members built from cross features and SVD features, blends them with a non-negative least-squares solver, and adds Platt scaling for classification.
On TabArena, which spans 51 datasets (38 classification, 13 regression) from 700 to 150,000 rows, TabFM-Ensemble ranks first and plain TabFM ranks second. That ordering holds on both classification and regression.
Google published the ranking but not the size of the gap. TabFM's repository includes the per-fold result files, so we computed the split ourselves across all 51 datasets and their folds.
| Task | Folds where ensemble wins | Datasets where ensemble wins | Median error reduction |
|---|---|---|---|
| Classification | 63.5% (377 of 594) | 29 of 38 | 0.73% |
| Regression | 76.6% (170 of 222) | 10 of 13 | 0.35% |
Read those two columns together, because they tell different stories. The ensemble wins often. It also wins by very little. Median error reduction per dataset sits under 1%.
That is the honest shape of ensembling on tables. It is a reliable small edge, not a miracle. You pay 32x the inference for a fraction of a percent, and it is still worth it at the top of a leaderboard where fractions decide rank.
Why does ensembling help tabular data at all?
The intuition predates foundation models by decades. Tabular machine learning has always been ensemble-first. Random forests average many decorrelated trees. Gradient boosting stacks weak learners in sequence. The winning entry on most tabular competitions is a blend, not a single estimator.
The reason is the bias-variance tradeoff. A single model makes errors that come partly from signal it missed and partly from noise it fit. Averaging several models whose mistakes are not identical cancels out the noisy part while keeping the shared signal. The more decorrelated the members, the more error cancels.
TabFM-Ensemble engineers that decorrelation on purpose. Cross features and SVD features give each member a different view of the same table, so their errors point in different directions. The NNLS blend then weights the members that disagree usefully. It is the same recipe a Kaggle grandmaster would recognize, wrapped around a neural network instead of trees.
Does the ensemble intuition transfer to LLMs?
This is the question a data scientist should ask next. If averaging decorrelated predictors helps trees and helps TabFM, does it help large language models too?
It does, but you have to be precise about what you are ensembling. TabFM ensembles the predictor. It runs the same task 32 ways and blends the numbers. Running three frontier LLMs to output a churn probability and averaging them is the naive version of that, and it is usually a bad idea. LLMs are weaker than TabFM or gradient boosting at raw tabular prediction, and averaging weak predictors does not fix a bias problem.
The transfer works at a different layer. Ensemble the reasoning, not the prediction. When the task is analyzing a dataset rather than scoring a row, the value of multiple models compounds.
What does Collaborative Synthesis over a table look like?
Collaborative Synthesis in AI Crucible runs several models over one prompt and has an arbiter merge their outputs into one document. It is the LLM analogue of a blend, applied to analysis instead of scores.
Point it at a tabular problem. Give Gemini 3.1 Pro, Claude Sonnet 5, and Qwen3.7-Max the schema, summary statistics, and a data dictionary for a credit-risk table, and ask each to reason about it. The three models rarely notice the same things:
- One flags a leakage risk where a feature encodes the label indirectly.
- One proposes an interaction term the others missed.
- One catches that a class imbalance will wreck naive accuracy.
Each model's analysis carries its own blind spots, exactly like a single decision tree. The arbiter synthesizes them into one review that keeps the shared conclusions and surfaces the disagreements. The decorrelated-errors logic that makes TabFM-Ensemble beat plain TabFM is doing the same work here, one level up the stack.
The difference is magnitude. TabFM's ensemble buys a sub-1% edge because its members are already near-optimal and highly correlated. Three different labs' models reasoning about a messy dataset are far less correlated, so the synthesis can catch issues a single model would ship. The gains are larger and noisier than the tidy leaderboard delta.
When is an ensemble not worth it?
Ensembling is a tool, not a default. Skip it when the economics or the statistics do not support it.
- Correlated members. Three checkpoints of the same base model make the same mistakes. Averaging them cancels little. Diversity is the whole point.
- A bias problem, not a variance problem. If every model is wrong the same way, no blend rescues you. Fix the features or the model class first.
- Cost-sensitive inference. A 32x or 3x inference bill for a fraction of a percent rarely pays off outside a leaderboard or a high-stakes decision.
- Simple, separable data. A single well-tuned gradient-boosted model is often enough. Reach for the ensemble when the last percent matters.
What should data scientists take from TabFM?
TabFM is a genuine shift. A zero-shot network beating tuned gradient boosting on tables was not obvious a year ago. But its own results restate a rule you already trusted: a diverse ensemble beats its best single member, even when that member is a frontier model.
The rule scales past trees and past tables. It holds for TabFM's 32 blended passes, and it holds for a panel of LLMs reasoning about your dataset. What changes is the layer you apply it at and the size of the payoff.
If you already reach for a blend on tabular problems, your instinct transfers to LLM workflows. Ensemble the analysts, keep the members decorrelated, and pay the extra compute only when the last percent is worth it.
Sources and further reading
- Introducing TabFM — Google Research blog
- TabFM 1.0.0 model card — Hugging Face
- google-research/tabfm — GitHub
- Google AI introduces TabFM — MarkTechPost
- Collaborative Synthesis strategy and the seven ensemble strategies in AI Crucible
Ensemble win-rate and error-reduction figures were computed from the per-fold result files published in the TabFM repository, not stated by Google.