Replenishment Readiness Playbook: Size-Level Masking, the 50/50 Allocation Matrix, and the Shared-Ownership Operating Model

What this playbook is

Most apparel planning teams have an allocation policy and a replenishment policy. Very few have a written, operator-grade playbook that connects them at the size level — the layer where full-price revenue is actually made or lost. This is that playbook.

It is designed for Planning Directors, VP Merch, and Allocation leads who are either (a) deploying AI-assisted replenishment for the first time and finding it surfaces inefficiencies they did not know they had, or (b) running size-level replenishment already and looking for a rigorous diagnostic, decision, and governance framework to codify it.

Everything below is operational. No abstractions, no reference architecture diagrams — frameworks you apply to your next receipt plan, decision matrices you use on your next allocation call, and a governance model you can stand up inside a single planning cycle.

// Key Takeaways

The 8-point size-level masking diagnostic identifies styles whose product-level sell-through hides a broken size curve — before you re-order the same curve next season.
The 50/50 allocation decision matrix tells you the correct initial-vs-replenishment split by fleet profile, product continuity, and transfer cost structure.
The stable-vs-noisy signal hygiene table is the short-list of demand inputs your replenishment engine should actually consume — and the longer list of inputs to deliberately exclude.
The test-and-learn iteration cadence is a four-to-six-week single-parameter cycle with a three-KPI scorecard, designed to prevent the over-correction that follows most lost-sales analyses.
The shared-ownership RACI assigns six size-curve governance decisions across planning, buying, supply chain, allocation, and e-commerce — because AI-surfaced inefficiencies always span functions.

Why size-level is the unit of analysis

Product-level KPIs are the default reporting layer for almost every apparel planning team. They are also where size-level demand problems go to hide. A style with a 70% sell-through, a sub-20% markdown rate, and a clean weekly sales curve can still be materially broken — if the core sizes stocked out on day two and the 70% sell-through was the composite of fast-moving core depletion followed by slow-moving fringe residual at full price.

The business cost of not surfacing this is threefold. First, next season's buy repeats the same size curve and reproduces the same ceiling. Second, replenishment decisions trained on the product-level signal over-replenish the residual fringe sizes and under-replenish the (already-empty) core. Third, the markdown provisioning at season end is calibrated to the wrong base case, so the finance partner is repeatedly surprised.

Every framework below sits on this foundation: size-level is the resolution at which apparel planning decisions are actually correct or incorrect. Everything aggregated up from there is a summary view, not a decision input.

Framework 1 — The 8-point size-level masking diagnostic

Run this diagnostic against any style flagged as a "healthy" performer. Each tell is independent; two or more firing on the same style indicates the product-level sell-through is masking a size-level problem.

Tell 1 — Day-2 or day-3 core-size stockout. A core size (typically the S/M/L that should carry 60–80% of category demand) reaches zero on-hand within 72 hours of first receipt. On a correctly-curved buy, core sizes should sell through across roughly the same window as the full range, not ahead of it.

Tell 2 — Residual concentration in top and bottom sizes. At end of selling window, 80%+ of remaining units are in XS/S or XL/XXL. Fringe-size residual on a nominally successful style is the single most reliable size-curve error signal.

Tell 3 — Sell-through slope discontinuity. The daily sell-through rate drops sharply after core sizes deplete, and the subsequent tail is linear rather than exponential. A correctly-curved buy produces a smoother decay curve.

Tell 4 — Viewed-availability delta (e-commerce). PDP impressions on OOS sizes (tracked via the size selector interaction) meaningfully exceed PDP impressions on in-stock sizes for the style. This is demand that cannot be inferred from sales data.

Tell 5 — Back-in-stock email signups concentrated by size. When 40%+ of back-in-stock signups for a style are on a single size, the stockout timing on that size likely happened inside the peak demand window.

Tell 6 — Returns disproportionately on fringe sizes. If return rate by size is 2–3x higher on XS or XXL than on M or L, the style is disproportionately sold to customers whose actual size was already out — fringe-size sales are catch-all substitutions.

Tell 7 — Store-cluster variance on sell-through. The same style performs materially differently across store clusters in a way that does not match the cluster's overall category performance. Size demand varies by cluster in ways a flat curve cannot capture.

Tell 8 — Wholesale account reorder frequency. If a wholesale account reorders the same style 2+ times within a selling window, the initial PO underestimated size-level demand at their customer base.

Scoring. Two tells firing → investigate. Three or more → size curve was structurally wrong; revise before the next buy. Five or more → the style's true revenue ceiling was 15–30% above observed; calibrate lost-sales estimate accordingly.

The most dangerous product-level KPI is "healthy sell-through with low markdown rate." It is also the most common signature of size-level masking. A buy planner who treats this KPI as ratification of the size curve will reproduce the error on the next PO.

Framework 2 — The 50/50 allocation decision matrix

The question of how much inventory to push to stores up front versus hold for replenishment is usually answered by habit — each brand has "the way we do it" and rarely revisits the decision. The correct answer varies with fleet size, product continuity, and transfer cost.

The decision matrix

| Business profile | Recommended holdback | Primary driver | |---|---|---| | Emerging DTC, 1–10 doors, seasonal-heavy | 15–25% | Short selling window; low inter-door transfer cost; upfront allocation risk is naturally contained by small fleet | | Mid-market DTC + wholesale, 10–30 accounts | 25–35% | Mixed channel dynamics; wholesale accounts reorder rather than replenish, so DTC holdback does more work | | Mid-market omni with 10–50 stores | 30–40% | Continuity and seasonal mix; meaningful transfer friction at scale; replenishment horizon is 2–3 weeks | | Multi-store chain, 50–150 stores, mixed continuity | 35–50% | Transfer cost is non-trivial; AI-assisted size-level replenishment generates meaningful upside on the back half | | Multi-store chain, 150+ stores, continuity-heavy | 40–60% | High transfer cost on wrong initial allocation; deep continuity inventory supports longer replenishment runway |

Why the range, not a single number

The reason ranges (not single numbers) is that three inputs move the correct answer inside each profile:

Product continuity. A core denim program with 3-year lifecycle supports more holdback than a fashion drop with a 10-week selling window. Continuity product rewards flexibility; seasonal product punishes delay.

Transfer cost structure. A fleet with twice-weekly DC-to-store replenishment and no inter-store transfer requirement supports more holdback than a fleet that depends on costly store-to-store rebalancing.

AI-assisted vs. manual replenishment. Manual replenishment struggles past a 25–30% holdback because the granularity of the allocation decision exceeds what planners can process. AI-assisted replenishment is the precondition for the upper end of each range.

Worked example

A 75-door mid-market omni chain historically allocates 80% of each seasonal buy to stores up front, handles imbalances with inter-store transfers, and ends each season with 8–12% of units in residual transit or rebalancing windows. Moving to a 55% holdback:

Initial allocation decision: 55% of the buy ships to the top two store tiers by category sell-through rate, weighted to door-level size demand profile.
Held 45% releases weekly against size-level stockout signals from the prior seven days, prioritising full-price sell-through doors.
Inter-store transfers drop by 60–75% because the initial allocation is deliberately conservative and the back half is released against real signal.
Full-price sell-through rises by 3–6 percentage points in the categories where the holdback applies, largely from closing the size-level stockout window faster on high-demand sizes.

The transfer-cost savings alone typically cover the operational investment in AI-assisted size-level replenishment within two seasons.

Set the holdback at the receipt-plan stage, not the allocation stage. If the receipt calendar assumes an 85% upfront allocation and the business needs 55%, there is literally no DC capacity to hold the back half when it lands. Holdback is a receipt-plan decision with allocation downstream consequences.

Framework 3 — The signal hygiene table

More demand inputs do not produce better replenishment decisions. The failure mode of a "comprehensive" replenishment stack is that noisy signals override stable ones at the precise moments when the stable signals are most valuable. Build the replenishment engine against the stable set; keep the noisy set as context for human review, not as algorithmic input.

Stable signals — use these as replenishment inputs

| Signal | What it tells you | Retention rule | |---|---|---| | Full-price sell-through by size | The actual rate at which the style is selling at margin | Weekly recompute; trailing 4–8 weeks depending on velocity | | Size-level stockout timing | The hidden-ceiling signal — when did core sizes go to zero | Per-style, from receipt forward; drives lost-sales estimation | | Weeks of supply by size by door | Triggers replenishment; drives priority ranking under DC constraint | Daily recompute if data feeds allow, else weekly | | E-commerce viewed availability | OOS-size impression count vs. in-stock impression count | Weekly; retained as lost-demand proxy for size-curve revision | | Back-in-stock email signups by size | Clean demand signal for SKUs where the customer self-declared intent | Rolling; used qualitatively to validate the stockout-timing signal |

Noisy signals — exclude from algorithmic replenishment

| Signal | Why it's noisy | How to use it | |---|---|---| | Climate / weather data | High variance at style level; correlates weakly with demand except at category extremes | Human-in-the-loop context for category-level calls only | | Promotional response curves | Confounds price and demand; introduces artifacts into sell-through velocity | Separate promo-impact model; do not feed raw into replenishment trigger | | Return reason codes | Useful only when signal is categorical and large (e.g., fit defect) | Quarterly review, not weekly input | | Social / influencer spike data | One-off, not base-rate; risk of over-replenishing a transient spike | Qualitative flag for buyer awareness only | | Competitive pricing | Off-equilibrium by design; creates false demand signals | Strategic review, not operational input |

The discipline

The hardest part is not building the table; it is refusing to override the stable set when a noisy signal fires. A cold snap does not license an emergency re-weight of outerwear size curves. An influencer spike does not justify pulling forward replenishment for a fringe size. The stable signals compound into correct decisions over a selling window; the noisy ones introduce errors that take a full cycle to unwind.

Framework 4 — The test-and-learn iteration cadence

When size-level masking is exposed and lost-sales estimates begin to appear, the instinct is to rebuild the size curve comprehensively. This is exactly the wrong move. Rebuilt curves over-correct predictably — the lost-sales signal in one direction obscures the residual risk in another, and the new curve creates its own markdown problem a season later.

The cycle

Step 1 — Isolate a single parameter. Pick one change: size-curve weighting on a category, or replenishment trigger threshold on a store tier, or holdback percentage on a product group. Not all three. If a multi-parameter change produces a good result, you will not know which lever caused it; if it produces a bad result, you will not know which to revert.

Step 2 — Cluster the test. Apply the change to a defined cluster — a store tier, a product group, or a channel — that is large enough for the signal to clear noise (typically 15–25% of the relevant fleet) and small enough that the rest of the fleet functions as a live control.

Step 3 — Track three KPIs. Sell-through rate. Estimated lost sales (derived from stockout timing, WOS decay, and viewed-availability delta — not a direct measurement). Stock coverage (WOS and inventory-to-sales ratio). Three KPIs; no more; reviewed on a fixed cadence.

Step 4 — Run for four to six weeks. Shorter windows are dominated by noise; longer windows delay the next iteration past the selling window. Four weeks for fast-moving categories; six for continuity.

Step 5 — Evaluate against the control. The change is a success if the three KPIs move favourably on the test cluster relative to the control, net of seasonality. The change is a failure if any of the three moves adversely beyond a defined threshold — revert immediately.

Step 6 — Iterate, do not escalate. A successful change is extended to an adjacent cluster, not to the full fleet. The second iteration may surface an interaction effect that the first did not; the third typically confirms the pattern. Full-fleet rollout is the fourth cycle, not the first.

Why the asymmetry matters

Under-correction is a small persistent tax on full-price sell-through. Over-correction creates its own residual problem that takes a full selling window to unwind — because the correction-of-the-correction cannot be applied until the selling window of the first correction closes.

The asymmetry argues for small, tight, frequent iterations rather than large, infrequent rebuilds. Teams that run the cycle well improve the size curve faster than teams that attempt comprehensive revisions — because every iteration compounds, and the iteration speed, not the amplitude, is the governing input.

Framework 5 — The shared-ownership RACI

AI-assisted size-level replenishment almost always surfaces an inefficiency the organization did not have visibility into before. The instinct is to assign the fix to whichever team runs the tool. The fix rarely works — because the root cause spans functions.

The six size-curve governance decisions

| Decision | Responsible | Accountable | Consulted | Informed | |---|---|---|---|---| | Size curve construction (per category) | Planning | VP Merch | Buying, E-commerce | Allocation, Supply Chain | | Holdback % (per product group) | Planning | VP Merch | Supply Chain, Allocation | Buying, Finance | | Replenishment trigger parameters | Supply Chain | Planning Director | Allocation, IT | Buying, Merch | | Store-cluster definition | Allocation | Planning Director | Retail Ops, Merch | Supply Chain | | Size-level signal review cadence | Planning | Planning Director | Buying, Supply Chain, Allocation, E-commerce | Finance, Merch | | Lost-sales estimation methodology | Planning | VP Merch | Finance, E-commerce | Buying, Supply Chain |

The weekly convening cadence

The RACI makes decisions clear; the cadence makes them happen. A single weekly size-level performance review is the working model:

Length: 45 minutes, fixed agenda
Attendance: planning lead, buying lead, supply chain lead, allocation lead, e-commerce analyst
Inputs: three-KPI scorecard (sell-through, lost sales, stock coverage) at the cluster level, plus the 8-point masking diagnostic for any style flagged in the prior week
Outputs: parameter changes logged with owner, cluster, and review date; decisions recorded against the RACI

The governance is lightweight on purpose. The failure mode to avoid is a heavyweight committee that meets monthly and lags the selling window. Four 45-minute weekly meetings beat one 3-hour monthly meeting on this problem every time.

Why the structure matters

Assigning ownership to a single function — even the function that runs the AI tool — leaves the other contributing causes in place. The buy that produced the wrong initial curve, the receipt plan that held too little flexibility, the trigger parameters that compounded the stockout, the allocation logic that did not reflect cluster-level size demand — each of these sits in a different function. A single-function owner can fix one of them; shared ownership across a disciplined cadence can fix all of them simultaneously, which is the only way the size-level inefficiency actually closes.

How to deploy this playbook in 90 days

Weeks 1–2 — Baseline. Run the 8-point masking diagnostic on the top 20% of styles by revenue from the last completed selling window. Calibrate lost-sales estimates. Identify the three categories where size-level masking is most severe.

Weeks 3–4 — Decision matrix. Apply the 50/50 decision matrix to the next receipt cycle. Set holdback by product group; log the decision; communicate to buying and supply chain.

Weeks 5–6 — Signal hygiene. Audit current replenishment engine inputs against the stable-vs-noisy table. Disable noisy inputs; document the exclusions. Confirm the stable five are flowing cleanly.

Weeks 7–10 — First test cycle. Pick one parameter, one cluster, four weeks, three KPIs. Run. Evaluate. Log.

Weeks 11–12 — Governance. Stand up the weekly 45-minute size-level review. Populate the RACI with names, not roles. Run the first two meetings. Adjust the agenda on the third.

By day 90: baseline masking diagnostic complete, first holdback revision live, signal hygiene audited, first test-and-learn cycle evaluated, governance cadence running. The subsequent 90 days compound the cycle.

How RetailNorthstar implements this playbook

RetailNorthstar is built around the primitives this playbook describes — size-level sell-through as the reporting unit, holdback as a first-class receipt-plan input, the stable signal set as the replenishment input, the shared KPI scorecard as the governance surface. The specific mechanics:

Size curves stored by category, channel, and demographic segment with trailing-sell-through recomputation and masking flags on styles whose product-level KPIs mask size-level stockouts.
Receipt plans with configurable holdback that flexes by product profile; the receipt calendar accepts the 50/50 decision and surfaces the DC capacity implications.
Replenishment triggers that consume the stable signal set only — full-price sell-through, size-level stockouts, WOS, viewed availability — with weekly recomputation against trailing velocity.
Size-level performance dashboard with the three governance KPIs visible to all five functions in the RACI, against cluster-level segmentation.
Test-and-learn instrumentation to apply parameter changes to defined clusters and evaluate against live controls over configurable windows.

The goal is not to replace the planning judgment; it is to give the judgment a resolution, a cadence, and a shared decision surface that match the size-level reality of the business.