In aerospace and defense, many late deliveries are not complete surprises. The signals were there: a supplier running single-site, a sole-source relationship with no qualified alternate, a geography with increasing geopolitical friction. Sourcing teams see those signals in isolation, in scattered spreadsheets and quarterly reviews, but rarely in a system that aggregates them into a coherent risk picture before the problem lands on the production line.
The default approach: tracking OTD and quality, escalating when something breaks: is reactive by design. It treats every supplier as a known quantity until it isn't. In a supply chain where lead times are measured in months and re-sourcing is measured in years, reactive is expensive.
Standard supplier scorecards measure delivery, quality, and responsiveness. Those dimensions matter, but they are backward-looking by nature. A supplier with 97% OTD and ITAR registration missing is not a safe supplier for defense work. A supplier with strong quality metrics and a single manufacturing site in a geopolitically exposed region carries asymmetric downside risk that the performance score doesn't capture.
The key architectural decision was to split the engine into two explicit layers: a Performance layer that scores what the supplier has done, and a Risk and Resilience layer that scores what could go wrong. Both layers contribute to the composite score, but they are computed independently and displayed separately. A sourcing manager can see in a single view whether a supplier's composite is being held up by strong performance despite high risk, or vice versa: and act accordingly.
Compliance is handled outside both layers entirely. A missing ITAR registration or AS9100 gap is not a scoring input: it is a hard flag that renders independently and can cap the recommendation regardless of composite score.
The engine is built in Python across five modules with strict separation of concerns. Scoring, compliance, and recommendation are fully deterministic. Claude generates the plain-language narrative and improvement actions from the computed scores: it never generates or modifies a number.
The recommendation engine produces one of four outcomes in fixed order of severity: Expand relationship, Maintain with monitoring, Issue corrective action plan, Initiate re-sourcing evaluation. Compliance caps are applied after the score-based recommendation. Caps can only increase severity, never reduce it. A supplier with a high composite score and a critical compliance gap will receive a more severe recommendation than the score alone would produce. The AI narrative receives the final recommendation as a fixed input and explains it. It cannot change it.
Performance Layer: scores historical supplier behavior across four dimensions.
Risk & Resilience Layer: scores forward-looking exposure across five dimensions.
Compliance caps are applied after the score-based recommendation using ordinal index comparison: never string matching. Caps only increase recommendation severity, never reduce it. A supplier scoring 91 (Expand) with missing ITAR registration on applicable work is capped at Issue Corrective Action Plan. The cap reason is passed directly to the AI narrative, which must acknowledge it explicitly.
Each certification is evaluated against an applicability toggle: ITAR, CMMC, and NADCAP flags only apply when the scope of work makes them relevant. A supplier without ITAR registration on commercial-only work receives no flag. A supplier without ITAR registration on defense-applicable work receives a critical flag and a recommendation cap.
The AI layer serves one purpose: translate a structured score profile into a readable supplier evaluation brief and a specific, actionable set of improvement recommendations. Every score and recommendation is computed before the AI is called. The prompt passes computed scores as fixed facts and instructs the model that implying a different recommendation is a defect.
When compliance caps are triggered: whether or not they escalated the recommendation: the prompt explicitly surfaces the gap and requires the narrative to address it. If ITAR is missing on applicable work, the brief must state "this supplier is disqualified from ITAR-relevant work until registration is obtained." Softened compliance language is treated as a prompt failure.
| Feature | What It Does | Why It Matters |
|---|---|---|
| Data Confidence Badge | Scores data quality 0–100 from months of history, transaction volume, data source, and audit recency. Renders as High / Medium / Low badge alongside composite score. | A 92 composite on 18 months of audited data is not the same as a 92 on three POs and a vendor self-report. The badge makes that distinction visible. |
| What-If Sensitivity | Computes composite lift from realistic improvements on the two lowest-scoring performance dimensions. Filters zero and negative deltas: only genuine improvements are shown. | Converts the scorecard from a snapshot into a planning tool. Tells the sourcing manager which dimension improvement has the most leverage before the supplier review meeting. |
| Commodity Profile Weights | Five profiles (Standard, Machined Parts, Electronics, Raw Material, Castings/Forgings) shift dimension weights within each layer. Layer weights remain user-controlled via scoring mode. | Electronics suppliers should be weighted heavily on geographic risk. Castings should be weighted heavily on quality. A single fixed weight table treats all commodities identically: this doesn't. |
| Spend Criticality Multiplier | Annual spend field applies a penalty to single-source dimension score. $250K+ triggers 10% reduction; $1M+ triggers 20% reduction. | Single-source on a $2K/year part is not the same risk as single-source on a $2M/year part. Consequence scales with exposure. |
| Multi-Supplier Comparison | Session-state comparison table accumulates up to 5 suppliers. Plotly heatmap with green-to-red diverging scale across all 9 dimensions. Best fit by scoring mode shown for all three modes simultaneously. | Supplier decisions are rarely made in isolation. The comparison view surfaces which supplier wins under each evaluation lens: and lets the sourcing manager see the tradeoff before committing. |
The Score Lineage expander in the UI renders a Plotly Sankey diagram showing exactly how weighted dimension scores flow into layer scores, and how layer scores combine into the composite. Each link's width is proportional to its weighted contribution. A sourcing manager being challenged on a score can open the lineage view and walk any reviewer through the exact math: dimension by dimension, weight by weight.
This is not decoration. In aerospace and defense sourcing, every recommendation is potentially reviewed by program managers, contracts, and legal. A scorecard that can't explain itself in a challenge setting is a liability. Score lineage makes the deterministic math fully transparent without requiring the reviewer to read code.
The test suite enforces 92 assertions including: PPM band boundary correctness, lead time piecewise interpolation (no band cliffs), validation bounds across all input fields, cap logic separation (cap_triggered vs cap_escalated), same-severity compliance triggering, sensitivity filtering, weight table integrity across all commodity profiles, and confidence score label assignment.
The Supplier Performance & Risk Scorecard is the deepest engine in the portfolio so far, with the most scoring dimensions and the most complex recommendation logic. It is designed to sit downstream of the Part Prioritization Framework (which surfaces which suppliers need attention first) and inform the Make vs. Buy Decision Framework (which decides whether to source at all).
Built to close the gap between "we track OTD in a spreadsheet" and "we have a structured, defensible view of supplier health across performance, risk, and compliance." Deterministic by design. Explainable by requirement. Aerospace-specific by intent. The engine produces the assessment. The sourcing engineer makes the call.