Calibration routines for 100-qubit machines

A 100-qubit superconducting machine is, on most days, a calibration problem wearing a physics costume. The chip itself behaves; the wiring, the flux landscape, the readout chain, and the temporal drift do not. If you cannot keep gate fidelities inside a tight envelope across all 100 qubits without burning your entire shift on tune-up, you do not have a 100-qubit machine — you have a 100-qubit demo. This is the part of the work that gets the least public attention and the most engineering hours, so it is worth being honest about what good calibration actually looks like at this scale.

Why calibration gets harder, not just bigger, at 100 qubits

At 5 qubits you can calibrate everything by hand in an afternoon. At 27 qubits you write scripts. At 100 qubits the topology stops being a courtesy and starts being a constraint. On a heavy-hex lattice each data qubit couples to two or three neighbours, each coupling has its own ZZ residual, and each two-qubit gate has its own optimal pulse shape that drifts on its own schedule. The combinatorics are not the headline problem — the headline problem is that calibration steps interact. Re-tuning a frequency to dodge a TLS (two-level system) defect on qubit 47 can push qubit 46 closer to a collision, which then degrades the CZ on the 46–52 edge you fixed last Tuesday.

So the design goal is not "calibrate everything"; it is "calibrate the minimum set, in the right order, often enough that the machine stays inside spec, and detect drift before users see it." Everything below follows from that.

The calibration hierarchy

A working calibration stack on a transmon device has roughly four layers, and they need to run on different cadences:

Bring-up — done once after cooldown to base temperature in the dilution fridge. Resonator spectroscopy, qubit spectroscopy, anharmonicity, T1, T2-echo, readout discrimination thresholds. Hours to a day for the full chip.
Daily tune-up — Rabi amplitude, DRAG coefficient, single-qubit gate phase, CZ amplitude and phase, readout assignment matrix. Minutes per qubit if scripted; the trick is parallelism.
Hourly drift correction — frequency tracking via Ramsey, readout offset, ZZ characterisation on active edges. Cheap, must be cheap, runs in the background.
Validation — randomised benchmarking, cross-entropy benchmarking, sometimes gate set tomography on a representative subset. This is what you cite when someone asks "how good is the machine right now."

The mistake people make is treating these as one workflow. They are four workflows, with different SLAs and different consumers. The hourly loop must never block a user job. The daily tune-up must never block the hourly loop. Bring-up is the only one allowed to take the machine offline.

Single-qubit calibration that actually scales

For single-qubit gates the routine is well-understood: Rabi to set amplitude, then DRAG to suppress leakage to the |2⟩ state, then a fine-amplitude error amplification sequence (often called fine_amp or AllXY) to squeeze out residual over- or under-rotation. The theory is twenty years old. What changes at 100 qubits is the bookkeeping.

The practical pattern is to run these in parallel batches grouped by frequency separation — qubits whose drive frequencies differ by enough megahertz that crosstalk on the drive line is negligible can be calibrated simultaneously. On a heavy-hex layout you typically end up with three or four parallel groups, which turns a sequential 100-qubit calibration into something closer to 25 sequential operations. Then you measure crosstalk explicitly: drive qubit i, look at the rotation induced on qubit j, build a crosstalk matrix, and either compensate in software or accept it as an error budget item.

The unglamorous truth: most of the engineering effort at this layer is in the data pipeline — fitting routines that handle bad fits gracefully, a database that knows which calibration belongs to which cooldown, and an alerting system that catches the qubit which has silently drifted out of spec since 03:00.

Two-qubit gates and the ZZ problem

Two-qubit gate calibration is where the work lives. On fixed-frequency transmons with tunable couplers (the architecture most relevant to a sovereign machine targeting near-term utility), the CZ or CR gate has more parameters than its single-qubit cousin and a much more uncomfortable error landscape. You are calibrating amplitude, duration, phase on both qubits, and — critically — the residual ZZ interaction during and after the gate.

ZZ is the gift that keeps on giving. Even idle qubits experience an always-on ZZ coupling to their neighbours, which dephases them whenever a neighbour is in |1⟩. Tunable couplers exist precisely so you can null this at idle, but the null point drifts, and it drifts differently from the gate operating point. So the calibration sequence is: find the coupler bias that nulls idle ZZ, find the coupler trajectory that performs a clean CZ, characterise the leakage to non-computational states, and then verify with a Hamiltonian tomography or a focused randomised benchmarking sequence on that edge.

Across 100 qubits with a heavy-hex topology you have on the order of 140 two-qubit edges. You cannot afford to calibrate every edge from scratch every day. The realistic strategy is full calibration on cooldown, then a fast drift-tracking sequence per edge that re-fits a small number of parameters around the previous solution. If the drift exceeds a threshold, escalate to a full re-cal for that edge only.

Benchmarking: randomised benchmarking, XEB, and when to reach for gate set tomography

You need a number to put on the gate. There are three serious options and they answer different questions.

Randomised benchmarking (RB) gives you an average gate fidelity by running random Clifford sequences of varying depth and fitting the decay. It is cheap, it is robust to state-preparation and measurement (SPAM) errors, and it is the right tool for daily tracking. Interleaved RB lets you isolate the fidelity of a specific gate. Simultaneous RB run on neighbouring qubits in parallel reveals addressing crosstalk. For 100-qubit qubit calibration this is the workhorse.

Cross-entropy benchmarking (XEB) was popularised by the Google supremacy work. It is more sensitive to coherent errors than RB and scales naturally to multi-qubit blocks, but the analysis is heavier and the interpretation is less crisp for engineering decisions. Use it for system-level validation, not daily tune-up.

Gate set tomography (GST) is the diagnostic microscope. It returns the full process matrices for your gates, including coherent errors that RB averages over. It is also expensive — sequence counts grow quickly, and the analysis takes serious compute. The right way to use GST on a 100-qubit machine is targeted: when RB on a specific edge shows a fidelity that is worse than its T1/T2 budget predicts, run GST on that one or two-qubit subsystem to find out whether the error is a phase miscalibration, a leakage channel, or a stochastic Pauli, and fix the underlying pulse. Do not run GST on the full chip. You will not finish.

Drift, automation, and the path to error correction

A 100-qubit machine that is not automated is a 100-qubit machine that spends most of its calendar in calibration. The target is a closed-loop system where the hourly drift tracker triggers a targeted re-calibration on a handful of qubits or edges, the daily tune-up runs without human intervention, and the benchmarking dashboard tells you the state of the machine in numbers a user job will see.

This matters more, not less, as the surface code roadmap comes into view. Surface-code error correction assumes that physical gate errors sit below a threshold (commonly cited around the 1% mark for the standard surface code, lower in practice if you want a useful logical qubit without enormous overhead). Hitting that threshold is a calibration story before it is an architecture story. Every 0.1% you leave on the table because a CZ has a slow phase drift you did not catch translates directly into more physical qubits per logical qubit, which translates into a smaller usable machine. For more on how this fits into the broader build plan, see our Ireland Quantum 100 programme.

For climate workloads specifically — variational chemistry for carbon capture catalysts, battery electrolyte simulation, the kind of jobs we are prioritising — the dominant error mode visible to the user is decoherence during the long ansatz circuits. That is a calibration-discipline problem more than a physics problem. The qubits are good enough; the question is whether you are keeping them at their good-enough operating point twenty-three hours out of twenty-four. More on that workload mix in our notes on climate workloads on Ireland Quantum 100.

Where to start this week

If you are standing up calibration infrastructure on any superconducting device — five qubits or a hundred — pick the smallest useful loop and close it before adding scope. Get a single qubit's Rabi, DRAG, and AllXY running automatically against a database, with a plot you can look at on Monday morning. Then add T1 and T2 tracking. Then add interleaved randomised benchmarking. Resist the temptation to script the full hierarchy on day one; you will get the order wrong and rebuild it anyway. The teams that ship working 100-qubit calibration are the teams that shipped working 5-qubit calibration first, and were ruthless about not letting the codebase rot between coold