Vendor Normalization Engine

User Manual & Job Aid | Version 1.0 | March 2026

What is the Vendor Normalization Engine?
Prerequisites — What You Need Before Running
Quick Start — Typical Workflow
First-Time Setup (SetupWorkbook)
The Config Sheet — All Settings Explained
The Run Sheet — Button Reference
Step 1 — Name Clustering (How It Works)
Step 2 — Oracle Reduce
Step 2b — Inactive Vendor Reduce (Optional)
Step 3 — Vendor Match
Understanding the Output Sheets
The VN_Review Sheet — Daily Analyst Workspace
Archiving Results for the Junior Analyst
The Archive Workbook — Summary Tab
Cancelling a Run in Progress
The Log Sheet
Algorithm Reference — For Advanced Users
Troubleshooting & Common Scenarios
Glossary

1 What is the Vendor Normalization Engine?

The Vendor Normalization Engine (VNE) is an Excel macro workbook that identifies duplicate and near-duplicate names within any large list, groups those names into similarity clusters, and cross-references each cluster against a system extract to surface matching identifiers and flag unrecognised entries. The result is a structured, colour-coded review package ready for analyst sign-off.

The tool was built with Accounts Payable supplier lists as the primary use case, but it is not limited to that context. Any dataset that provides a name column and a corresponding identifier column can serve as both the source list and the reference extract. Practical applications include:

AP supplier names — group near-duplicate vendor names and match to Oracle vendor numbers
Manufacturer names — detect inconsistent manufacturer entries and reconcile against a manufacturer code reference
Customer names — identify duplicate customer records and cross-reference against a customer ID extract
Item descriptions — cluster similar item or material descriptions and match to an item master
GL account descriptions — detect naming inconsistencies across a chart of accounts

The reference extract (Oracle or otherwise) is connected via two configurable column headers on the Config sheet — one for the name column and one for the identifier column. No column names are hard-coded in the matching logic. The labels Vendor Number and Vendor Name that appear throughout the output sheets are the tool's internal terminology for the identifier and name roles respectively, regardless of what those columns are called in your source data.

In a large organisation, names accumulate over time with inconsistent spelling, punctuation, capitalisation, and abbreviations. A single supplier such as Capita may appear as CAPITA PLC, Capita Ltd, CAPITA LIMITED, and Capita, (with a trailing comma), all referring to the same entity. Without automated grouping, an analyst must identify these variants manually row by row — a process that is both time-consuming and error-prone at scale.

The VNE automates this by scoring every pair of names for similarity, merging similar names into clusters, and then attempting to match each cluster to a record in the reference extract. The final output is a colour-coded, collapsible worksheet that an analyst can work through systematically to confirm whether clustered names represent the same entity, flag potential duplicates, and verify system identifiers.

Who is this for?
Senior analyst / processor — configures the source and reference files, runs the macro, interprets the high-level results, and determines whether the output is ready to hand off.
Junior analyst — receives the archived workbook, opens the VN_Review tab, and performs the granular row-by-row verification against the system of record.

2 Prerequisites — What You Need Before Running

Before running any step, confirm that your Excel environment is ready and that you have the required input files available in an acceptable format. This section covers every prerequisite in detail, including what each file must contain, where it can be stored, and how to handle common variations in file structure.

2.1 Excel version and macro security

Supported Excel versions

The VNE is compatible with Excel 2016, 2019, 2021, and Microsoft 365 on Windows. It is not compatible with Excel for Mac (the FileDialog folder picker and certain charting methods used by the Archive module are Windows-only). Excel Online (browser) is not supported — macros do not run in the browser version.

Enabling macros when you open the workbook

When you open the .xlsm file, Excel displays a yellow security bar below the ribbon that reads "SECURITY WARNING — Macros have been disabled." Click Enable Content on that bar. Without this, no buttons or macros will function. You will need to click Enable Content each time you open the file unless you add the file's folder to Excel's Trusted Locations (see below).

Adding a Trusted Location (recommended for regular use)

If the workbook is stored in a location you control (local drive or a specific network share), you can prevent the security prompt from appearing on every open:

In Excel, go to File > Options > Trust Center > Trust Center Settings.
Click Trusted Locations, then Add new location.
Browse to the folder where the VNE workbook is stored. Tick "Subfolders of this location are also trusted" if you want the whole folder tree to be trusted.
Click OK and restart Excel. The workbook will now open with macros enabled automatically.

Only trust locations you control. Do not add a network share as a Trusted Location unless you are certain that no one else can place files there. Trusted Locations bypass all macro security warnings for any file opened from that path.

If the Developer tab is not visible

The Developer tab is needed to assign macros to buttons after running SetupWorkbook. To show it: File > Options > Customise Ribbon → tick Developer in the right-hand list → click OK.

2.2 Input file 1 — Name Source (required for Step 1)

What it is

The Name Source is any file that contains the raw list of names you want to cluster and analyse. In an AP context this is typically a daily or weekly transaction extract, a vendor master export, or a purchase order line-item dump. In other contexts it might be a manufacturer directory, a customer list, or an item description export.

Minimum requirements

At least one column containing names. The column can have any header — you specify the header (or column position) on the Config sheet.
A single header row. The engine reads row 1 as headers and all subsequent rows as data. If your file has no header row, use a column letter or number instead of a header name in C7 (e.g. enter A to use column A).
The file must be closed before running Step 1. Excel cannot read a file that is locked open by another application or another Excel instance.

Supported formats

Format	Extension	Notes
Excel Workbook	`.xlsx`	Standard format. Recommended.
Excel Macro-Enabled	`.xlsm`	Accepted. Macros in the source file are not executed.
Excel 97–2003	`.xls`	Accepted. May be slower to open for large files.
CSV (comma-separated)	`.csv`	Accepted. Opened in text mode to preserve leading zeros. All columns treated as text — no automatic date or number conversion.

File stored on a network drive

Network paths are fully supported. Enter the UNC path in the Config sheet, e.g. \\server\shares\AP\Daily. The machine running the VNE must have read access to that path. If the path requires authentication, ensure you are connected to the network share before running the macro.

File name changes daily

Many AP extracts are named with a date stamp, e.g. AP_Extract_14MAR2026.xlsx, AP_Extract_15MAR2026.xlsx. You must update C5 on the Config sheet each day to reflect the new file name, or use the Open Wizard button to re-browse to the new file. The folder path in C4 only needs to be set once if the file always lands in the same folder.

Source data in this workbook

If you paste your name list directly into a sheet inside the VNE workbook, you can skip the external file entirely. Leave C4 and C5 blank. Enter the sheet tab name in C6 (or leave C6 blank to auto-detect by column header). The engine will read from that sheet without opening any external file.

Tip — what makes a clean Name Source. The engine handles mixed case, leading/trailing spaces, and punctuation variations automatically during normalisation. You do not need to clean the source before running. However, if your source contains many blank rows or entirely non-name entries (e.g. account codes mixed in with vendor names), consider filtering those out first so they do not inflate the cluster count with irrelevant single-entry clusters.

2.3 Input file 2 — Reference Extract (required for Step 2)

What it is

The Reference Extract is the system-of-record file used to look up an official identifier for each name cluster. In an AP context this is the Oracle vendor master export. In other contexts it might be a manufacturer code list, a customer master export, or an item master — any file that pairs a name with a system identifier.

Minimum requirements

An identifier column — the system ID (vendor number, item code, customer ID, etc.). Any column header or position. Configured via C13.
A name column — the official name in the system that will be compared against the clustered names from the source. Any column header or position. Configured via C14.
A single header row. If the file has no header row, use column letters or numbers in C13/C14.
The file must be closed before running Step 2.

Column header examples by use case

Use case	Identifier column (C13)	Name column (C14)
AP supplier list	`VENDOR_NUMBER`	`VENDOR_NAME`
Manufacturer directory	`MFR_CODE`	`MANUFACTURER_NAME`
Customer master	`CUSTOMER_ID`	`CUSTOMER_NAME`
Item master	`ITEM_NUMBER`	`ITEM_DESCRIPTION`
No header row (column B = ID, column C = Name)	`B`	`C`

File size considerations

The Reference Extract can be very large — Oracle vendor master exports of 500,000+ rows are common and are handled without issue. Step 2 reads the entire file into memory and deduplicates it, so the primary constraint is available RAM. On a machine with 8 GB RAM, files up to approximately 1 million rows process without difficulty. For files above that size, ensure no other large applications are running concurrently.

Duplicate rows in the Reference Extract

The extract almost certainly contains duplicate identifier + name combinations (one row per transaction or account line is typical). This is expected — Step 2 exists precisely to deduplicate the extract before Step 3 uses it. You do not need to pre-clean or deduplicate the reference file.

2.4 Input file 3 — Inactive / Deactivated Records (optional, Step 2b)

What it is

The Inactive file is a secondary reference that lists records that have been deactivated or retired in the system, along with the date of deactivation. In AP this is the Oracle inactive vendors report. This file is entirely optional — if you do not have one, or do not need inactive matching, leave C29 blank and Step 2b will be skipped automatically.

Minimum requirements (if used)

A name column — the name of the deactivated entity. Configured via C31.
An identifier column — the system ID of the deactivated entity. Configured via C32.
A date column — the date the record was deactivated. Configured via C33. The date value is preserved as-is in the output (no reformatting).

Merged title row handling

Some Oracle inactive vendor reports are formatted with a large merged title cell in row 1 (e.g. "Oracle AP Inactive Suppliers — Exported 14-Mar-2026") and the actual column headers in row 2. The engine detects this automatically: if row 1 does not contain recognisable column headers, it advances to row 2 before scanning. You do not need to manually delete the title row from the file before running.

When to run Step 2b

Step 2b only needs to be re-run when a new inactive vendors file arrives. In most organisations this is quarterly, coinciding with the full Oracle extract refresh. Once Inactive_Unique exists in the workbook, it remains available for all subsequent Step 3 runs until it is overwritten by a new Step 2b run.

2.5 Step independence — what needs re-running and when

Trigger event	Steps to re-run	Steps to skip
New daily / weekly name source batch	Step 1, Step 3 (or use Quick Rerun)	Steps 2 and 2b — reference data is unchanged
New quarterly Oracle extract	Step 2, then Step 3 (and Step 1 if the name source also changed)	Step 2b if the inactive file has not changed
New quarterly inactive vendors file	Step 2b, then Step 3	Steps 1 and 2 if unchanged
Changed a Config setting (threshold, window, column header)	Re-run the affected step and all downstream steps	Unaffected upstream steps
Full refresh (all files updated)	All steps in order: 2 → 2b → 1 → 3 (or use Run All)	Nothing

3 Quick Start — Typical Workflow

First-time configuration takes approximately 10 minutes. Every subsequent daily run takes a single click via Quick Rerun.

1	Open the workbook. Go to the Run sheet. If the Run and Config sheets do not yet exist, click Setup Workbook in Developer > Macros, or see Section 4.
2	Fill in the Config sheet. At minimum, provide: vendor name column header, Oracle vendor number column header, Oracle vendor name column header. Point each section to the correct source file. See Section 5 for every field.
3	Run Step 2 — Oracle Reduce. This is a one-time (quarterly) step that reads the Oracle extract and produces the Oracle_Unique reference sheet. Click Step 2 — Oracle Reduce on the Run sheet.
4	(Optional) Run Step 2b — Inactive Vendor Reduce. If an inactive vendors file is configured, click Step 2b — Inactive Vendor Reduce. This produces the Inactive_Unique reference sheet.
5	Run Step 1 — Name Clustering. This reads the daily vendor name source and groups similar names. Click Step 1 — Name Clustering on the Run sheet. A progress bar will appear. For large datasets this step may take several minutes.
6	Run Step 3 — Vendor Match. This matches each cluster to Oracle vendor numbers and produces both VN_Matched and VN_Review.
7	Review the results. Open VN_Review. Each cluster appears as a bold summary row with variant detail rows collapsed underneath. Click + to expand a cluster and inspect the spelling variants and their Oracle match status.
8	Archive and hand off. When the first-pass review is complete and the data looks ready for the junior analyst, click Archive Results on the Run sheet. The engine exports a clean, macro-free `.xlsx` workbook to the folder of your choice. See Section 13.

Tip — subsequent daily runs. Once Steps 2 and 2b have been run for the current quarter, you only need to repeat Steps 1 and 3 each day. Click Quick Rerun on the Run sheet — it runs both steps back-to-back using the settings already saved on the Config sheet, with no form to fill in.

4 First-Time Setup (SetupWorkbook)

The VNE workbook ships as a macro-enabled file (.xlsm) containing all the VBA modules but none of the working sheets. The first time you open it, the Run, Config, and Log sheets do not yet exist. A one-time setup procedure creates those sheets, populates all default labels and values, and prepares the workbook for use. This takes approximately two minutes.

4.1 Run SetupWorkbook

1	Open the `.xlsm` file. Click Enable Content on the yellow security bar if prompted (see Section 2.1).
2	Press Alt + F8 to open the Macro dialog. Alternatively: go to Developer > Code > Macros. If you do not see the Developer tab, go to File > Options > Customise Ribbon and tick Developer.
3	In the macro list, locate and select VNE3_Setup.SetupWorkbook. If the list is long, type `Setup` in the search box at the top to filter it. Click Run.
4	The macro runs silently for a few seconds, then displays a confirmation message: "Workbook set up. Run sheet → assign macros to buttons. Config sheet → fill in column C. Log sheet → run history is written here automatically." Click OK.
5	Three new sheet tabs will now be visible at the bottom of the workbook: Run (amber tab), Config (light-blue tab), and Log. Navigate to the Run sheet to continue.

Safe to re-run at any time. SetupWorkbook checks each cell before writing — it only writes a default value if the cell is currently blank. Any value you have already typed in column C of the Config sheet is preserved. You can safely re-run SetupWorkbook after a VBA module update to refresh labels and add any newly added Config rows without losing your settings.

4.2 Assign macros to buttons on the Run sheet

SetupWorkbook creates nine button shapes on the Run sheet and writes the target macro name next to each one (column D). The buttons are not yet connected to the macros — you must assign them manually once. This is a one-time step per workbook.

1 On the Run sheet, right-click the first button shape. A context menu appears. Choose Assign Macro…

2 In the Assign Macro dialog, locate the macro name shown in column D next to that button. Click it in the list, then click OK. The button is now linked.

Repeat for each button. The full assignment table is:

Button label	Macro to assign
Open Wizard	`VNE7_Orchestrator.ShowForm`
Quick Rerun (use saved settings)	`VNE7_Orchestrator.QuickRerun`
Step 1 — Name Clustering	`VNE7_Orchestrator.RunStep1_Cluster`
Step 2 — Oracle Reduce	`VNE7_Orchestrator.RunStep2_OracleReduce`
Step 2b — Inactive Vendor Reduce	`VNE7_Orchestrator.RunStep2b_InactiveReduce`
Step 3 — Vendor Match	`VNE7_Orchestrator.RunStep3_Match`
Run All (Steps 1 + 2 + 2b + 3)	`VNE7_Orchestrator.RunAll`
Setup Workbook	`VNE3_Setup.SetupWorkbook`
Archive Results	`VNE9_Archive.RunArchive`

4 Save the workbook (Ctrl + S). Button assignments are saved with the file.

Tip — testing a button. After assigning, click any button once with the left mouse button (not right-click). The macro will run. For the Setup Workbook button, clicking it will re-run SetupWorkbook — this is safe and will just refresh labels. For step buttons (Step 1, Step 2, etc.), you will see a confirmation prompt before anything is processed.

If a button does nothing when clicked: right-click it and check Assign Macro — if the assigned macro name shows "[macro name] (not found)", the module has not been imported into the workbook. Verify that all VBA modules (VNE1_Config.bas through VNE9_Archive.bas) are present in the VBA Editor (Alt + F11, check the left-hand Project tree).

4.3 Save the workbook in macro-enabled format

The workbook must be saved as .xlsm (Excel Macro-Enabled Workbook) to retain the VBA modules and button assignments. If you save it as .xlsx, Excel will strip out all macros.

To confirm the file type: check the title bar — it should show the file name ending in .xlsm. If you see a "Some features may be lost" warning when saving, choose Keep Current Format (not Use Office Open XML Format).

4.4 What SetupWorkbook creates

Sheet	Tab colour	Purpose	User edits?
Run	Amber	Nine button shapes, each labelled with its function and the corresponding macro name. This is the primary operating screen — all day-to-day interactions happen here.	No — do not add data to this sheet. Button labels are refreshed by SetupWorkbook.
Config	Light blue	All user-configurable settings. Column B = label (read-only), column C = your value, column D = hint / default. Every path, file name, column header, threshold, and archive setting lives here.	Yes — this is the only sheet you should edit. All settings are entered in column C.
Log	Default	Timestamped record of every macro run — module name, outcome, row counts, file paths. Written automatically.	No — written by the macros. You may read it and filter it but should not edit it.

The output sheets (VN_Clusters, Oracle_Unique, Inactive_Unique, VN_Matched, VN_Review) do not exist yet after setup. They are created automatically the first time each corresponding step is run.

5 The Config Sheet — All Settings Explained

All user-configurable settings are stored on the Config sheet in column C. Column B contains the read-only label and column D the hint or default reminder. No settings exist anywhere else in the workbook — the Config sheet is the single point of configuration. If you need to change a path, column header, threshold, or archive setting at any point, this is the only place to edit it.

How to read the Config sheet layout. Each setting has three columns:

Column B — read-only label (do not edit)
Column C — your value (the only column you type in)
Column D — hint text explaining the field and showing the default where applicable

Section headers (amber rows) divide the settings into logical groups. Blue-tinted rows between groups are padding only.

5.1 Vendor Name Source — rows 4–7 (Step 1 input)

This section tells the engine where to find the raw list of names to be clustered. This is the only input consumed by Step 1.

C4 — Vendor Source Path

The folder path containing the source file. Examples:

C:\AP\Daily Extracts — local drive
\\fileserver\AP\Extracts — UNC network path
D:\Projects\VNE\Input — alternate local drive

Leave blank if the source data is on a sheet inside this workbook (no external file is needed). Do not include a trailing backslash — the engine appends one automatically.

C5 — Vendor Source File Name

The exact file name including extension. Examples:

AP_Extract_14MAR2026.xlsx
Vendor_List_Q1_2026.csv
raw_names.xls

Leave blank to read from a sheet inside this workbook. If both C4 and C5 are blank, C6 and C7 are used to locate a sheet in this workbook.

If your file name changes with each extract (e.g. date-stamped names), you must update this cell each time a new file arrives. There is no wildcard support — the exact name must be entered.

C6 — Vendor Source Sheet Name

The sheet tab name within the file (external or internal). Examples: Sheet1, AP Export, Vendor_Names.

Leave blank to auto-detect. When blank, the engine scans all sheets in the file and returns the first sheet whose header row contains the column specified in C7. This is convenient when the sheet name varies between file versions. If your file has only one data sheet, you can reliably leave this blank.

C7 — Vendor Name Column Header (Required)

Identifies which column in your source file contains the names to cluster. Three input formats are accepted:

Format	Example	When to use
Header name (text)	`VENDOR_NAME`	Your file has a header row and you know the column heading. This is the most reliable option — it works regardless of column position.
Column letter	`C`	Your file has no header row, or you prefer to specify position directly.
Column number	`3`	Same as column letter but numeric. Column A = 1, B = 2, C = 3, etc.

This cell cannot be left blank — it is the only truly required field for Step 1. If it is blank when Step 1 is run, the engine will display a validation error and stop.

Header name matching is case-insensitive. Entering vendor_name, Vendor_Name, or VENDOR_NAME will all work if your column header is VENDOR_NAME.

5.2 Reference Extract — rows 10–14 (Step 2 input)

This section tells the engine where to find the system-of-record reference file used to match clusters to official identifiers. In an AP context this is the Oracle vendor master export. In other contexts it may be any file containing an identifier + name pairing.

C10 — Reference File Path

Folder path of the reference extract, using the same format as C4. Leave blank if the data is in a sheet inside this workbook. Examples:

C:\Oracle Exports\Q1 2026
\\server\finance\oracle\vendor_master

C11 — Reference File Name

Exact file name including extension. Leave blank to read from a sheet in this workbook. The reference extract is typically provided by your Finance or IT team on a quarterly basis — the file name here will need to be updated each quarter.

C12 — Reference Sheet Name

Sheet tab name within the file. Leave blank to auto-detect by the column headers specified in C13 and C14. If the file has multiple sheets, leaving this blank will return the first sheet that contains both required column headers.

C13 — Identifier Column (Required)

The column in your reference file that contains the system identifier — the code or number that uniquely identifies each record. This value is carried into the output sheets as VENDOR_NUMBER (the tool's internal label for this role). Examples:

Your column header	Use case
`VENDOR_NUMBER`	Oracle AP vendor master
`SUPPLIER_ID`	Alternative Oracle export format
`MFR_CODE`	Manufacturer directory
`CUSTOMER_ID`	Customer master
`ITEM_NUMBER`	Item master
`B`	No header row, identifier in column B

Preserve leading zeros. If your identifiers are numeric codes with leading zeros (e.g. 00123456), use a .csv source file rather than .xlsx for the reference extract. The engine opens CSV files in text mode, which prevents Excel from stripping leading zeros. Alternatively, ensure the identifier column in your .xlsx file is formatted as Text before saving.

C14 — Name Column (Required)

The column in your reference file containing the official name associated with each identifier. This is the value that Step 3 normalises and compares against the clustered names from Step 1. The column is labelled VENDOR_NAME in the output sheets.

Enter the exact header name, a column letter, or a column number — same three formats as C7. A header name is recommended for reliability.

The name in this column does not need to be a perfect match to the names in your source list — the engine normalises both sides (strips suffixes, punctuation, case) before comparing. However, the more the source names resemble the reference names after normalisation, the higher the match rate will be.

5.3 Output Sheet Names — rows 17–20

By default, the four output sheets are named VN_Clusters, Oracle_Unique, VN_Matched, and Inactive_Unique. If your organisation has a naming convention or you are running the VNE for a specific project and want descriptive tab names, you can override the defaults here.

Cell	Output sheet	Default	When you might override
`C17`	Step 1 clusters output	`VN_Clusters`	Running VNE for manufacturers: rename to `MFR_Clusters`
`C18`	Step 2 reference unique output	`Oracle_Unique`	Non-Oracle source: rename to `Ref_Unique` or `MFR_Master`
`C19`	Step 3 full match output	`VN_Matched`	Rename to `MFR_Matched` for clarity in a manufacturer context
`C20`	Step 2b inactive unique output	`Inactive_Unique`	Rename to `Retired_Items` or `Discontinued_MFR`

Leave blank to use defaults. If any cell in this section is blank, the engine uses the default name shown in column D. Defaults are pre-filled by SetupWorkbook but can be cleared if you prefer to use custom names. If you rename an output sheet after it has been created, update the corresponding Config cell to match — otherwise the engine will create a new sheet with the default name and leave the old one orphaned.

5.4 Algorithm Tuning — rows 23–25

These three settings control the clustering algorithm's sensitivity and speed. The defaults were calibrated against real AP data and are appropriate for most use cases. Do not change them unless you have a specific reason and have validated the effect on a test dataset.

C23 — Cluster Similarity Threshold (default: `0.85`)

Two names must score at or above this value to be merged into the same cluster. The scale runs from 0 (completely different) to 1 (identical). In practice:

Value	Behaviour	Risk
`0.90 – 1.00`	Very conservative. Only near-identical names cluster together (e.g. same name with one letter difference). Most spelling variations will land in separate clusters.	High miss rate — obvious variants not grouped.
`0.85` (default)	Balanced. Handles capitalisation, punctuation, common suffixes, and minor spelling differences. Most vendor name variations encountered in AP practice cluster correctly.	Occasional false positives in densely similar name spaces.
`0.75 – 0.84`	More aggressive. Catches more distant variants (e.g. abbreviations, missing words). Some unrelated names with similar character sequences may be incorrectly grouped.	False positive rate rises noticeably below 0.80.
`Below 0.75`	Very aggressive. Clusters will be large and likely to contain unrelated names. Not recommended for production use.	High false positive rate. Results unreliable.

When to consider a lower threshold: if your source data uses extreme abbreviations (e.g. ATLS GRHMN instead of ATLAS GRAHAM), or names that are always shortened in transactions, a threshold of 0.80–0.82 may produce better results. Test on a small known dataset first.

When to consider a higher threshold: if your dataset contains many legitimate vendor names that are genuinely similar to each other (e.g. a directory of related companies such as Smith Engineering and Smith Consulting), a threshold of 0.87–0.90 may reduce false merges.

C24 — Match Similarity Threshold (default: `0.82`)

This field is reserved for future use. Step 3 currently performs exact matching only (after normalisation) — it does not perform fuzzy matching against the reference extract. The value here has no effect on current results. It is retained in the Config sheet to support a planned future enhancement that will add a fuzzy fallback pass for names with no exact match. Leave it at the default.

C25 — Clustering Window Size (default: `150`)

The sliding window controls how many adjacent names are compared against each entry in the sorted list. A larger window catches more distant variant pairs but increases processing time significantly (approximately proportional to window² × unique name count).

Dataset size (unique names)	Recommended window	Approx. Step 1 time
Up to 5,000	`150` (default)	2–5 minutes
5,000 – 15,000	`150`	5–15 minutes
15,000 – 30,000	`100`	10–20 minutes
30,000 – 60,000	`75`	15–30 minutes
Above 60,000	`50`	20–45 minutes

Reducing the window means the engine will only compare each name against the nearest N names in sorted order. Names that are similar but alphabetically distant may not be caught. In practice, most genuine near-duplicate vendor name variants are alphabetically close (e.g. Capita variants all start with C), so even a window of 75 captures the majority of true clusters.

Do not change algorithm settings mid-run. If you change the threshold or window and then run only Step 1 without re-running Step 3, the output sheets will be inconsistent — clusters will be based on the new settings but the match results will be based on the old clusters. Always re-run Step 3 (or both steps via Quick Rerun) after changing any algorithm setting.

5.5 Inactive / Deactivated Records — rows 28–33 (Step 2b input)

This entire section is optional. If you do not have an inactive records file, or do not need to distinguish between active and inactive matches, leave all six cells blank. The engine will skip Step 2b automatically and the Inactive_Unique sheet will not be created. The MATCH_NOTE column in output will contain only "exact match" or "no match found".

C28 — Inactive File Path

Folder path of the inactive records file. Same format as C4. Leave blank if the file is a sheet in this workbook.

C29 — Inactive File Name (leave blank to skip Step 2b)

Exact file name including extension. This is the master switch for Step 2b. If this cell is blank, the engine ignores all other settings in this section and skips Step 2b entirely — whether you run it individually, via Run All, or via Quick Rerun. You do not need to configure or clear the other cells in this section to suppress Step 2b; simply leaving C29 blank is sufficient.

C30 — Inactive Sheet Name

Sheet tab name within the inactive file. Leave blank to default to the first sheet. The engine automatically detects merged title rows in row 1 — if row 1 appears to be a report header rather than column headers, it advances to row 2 before scanning. You do not need to modify the file before running.

C31 — Inactive Name Column (Required if C29 is filled)

The column containing the name of the deactivated entity. Header name, column letter, or column number. Examples: Vendor Name, SUPPLIER_NAME, B, 2.

C32 — Inactive Identifier Column (Required if C29 is filled)

The column containing the system identifier of the deactivated entity. Header name, column letter, or number. Examples: Vendor Number, SUPPLIER_ID.

C33 — Inactive Date Column (Required if C29 is filled)

The column containing the date the record was deactivated. Header name, column letter, or number. Examples: Vendor Inactive Date, INACTIVE_DATE, Deactivation Date.

The date value is read as-is and stored without reformatting. It appears in the VENDOR_INACTIVE_DATE column of Inactive_Unique, VN_Matched, and VN_Review, giving the analyst visibility of when the entity was retired.

5.6 Review Highlight Thresholds — rows 36–37

These two settings control the colour-coding applied to summary rows in the VN_Review sheet. They let you tune the visual priority system to your dataset's characteristics.

C36 — Minimum distinct identifiers to flag (default: `2`)

A cluster's summary row in VN_Review is highlighted (amber) when the cluster contains this many or more distinct system identifiers matched across its variants. A cluster with two different vendor numbers for what appears to be the same supplier is a potential duplicate supplier setup and warrants review.

Setting this to 1 would highlight every cluster that has any match at all — producing too much noise in most datasets. The default of 2 means you only see amber rows when there is a genuine multi-ID situation.

Raising this value (e.g. to 3) means only clusters with three or more distinct IDs receive a highlight. Useful if your dataset legitimately contains a large number of single-company groups that have two Oracle records for different legal entities.

C37 — Red escalation if total occurrences ≥ (default: `7`)

Within already-amber clusters (those meeting the C36 threshold), the highlight is escalated to red if the total occurrence count across all variants in the cluster is at or above this value. Red rows in VN_Review represent the highest-risk items — high-volume, multi-ID clusters with many transactions potentially split across duplicate system records.

Lowering this value produces more red rows — more items are treated as high priority. Useful for datasets with a high proportion of low-volume transactions, where even 3–4 occurrences indicates a recurrent issue.

Raising this value restricts red rows to only the highest-volume clusters. Useful when the dataset is very large and you want the analyst to focus exclusively on the very top items.

5.7 Archive Settings — rows 40–41

These two settings provide defaults for the Archive dialog that appears when you click Archive Results on the Run sheet. They can be changed directly in the dialog at archive time — updating them there also saves the new values back to these cells automatically, so they become the new default for the next run.

C40 — Archive Save Folder

Default folder path where archive files are saved. Examples:

C:\AP\Review Archives
\\server\AP\VNE Archives\2026

If the specified folder does not exist when Archive Results is run, the macro will attempt to create it. If it cannot (e.g. no write permission to the network path), you will receive an error and be prompted to choose a different location. Leave blank to be prompted with an empty path in the Archive dialog each time — you will then need to type or browse to the destination manually on every archive run.

C41 — Archive Base File Name

The base name used to construct the archive file name. The macro appends the run date and a version suffix automatically:

BaseName_DD_MMM_YYYY.xlsx
BaseName_DD_MMM_YYYY_V2.xlsx (if V1 already exists)

Choose a base name that identifies the dataset and context. Examples:

Context	Suggested base name	Resulting file name
AP supplier review	`VNE_Review`	`VNE_Review_14_MAR_2026.xlsx`
Quarterly full refresh	`VNE_Q1_2026`	`VNE_Q1_2026_14_MAR_2026.xlsx`
Specific operating unit	`VNE_EMEA_Suppliers`	`VNE_EMEA_Suppliers_14_MAR_2026.xlsx`
Manufacturer review project	`MFR_NormReview`	`MFR_NormReview_14_MAR_2026.xlsx`

Avoid characters that are not valid in Windows file names: \ / : * ? " < > |. Spaces are permitted but underscores are recommended for compatibility with systems that do not handle spaces in file paths.

5.8 Config sheet — minimum viable configuration

For users who want to get running as quickly as possible, the following three cells are the absolute minimum required before Step 1 and Step 3 can execute:

Cell	What to enter	Why it's required
`C7`	Header of the name column in your source file	Step 1 cannot identify which column to read without this.
`C13`	Header of the identifier column in your reference file	Step 2 cannot build Oracle_Unique without this.
`C14`	Header of the name column in your reference file	Step 2 cannot build Oracle_Unique without this.

All path fields (C4, C5, C10, C11) can be left blank if your source data is already on sheets inside this workbook. All other settings have functional defaults and do not need to be populated before your first run.

6 The Run Sheet — Button Reference

The Run sheet is the primary entry point. Every macro function is accessible from here. Column B shows the button label, column C the description, and column D the macro being called.

Button	What it does	When to use it
Open Wizard	Opens the configuration wizard form (frmVNE). Allows you to browse to source files, map column headers, choose which steps to run, and save the settings to the Config sheet — all from a single guided dialog.	First time setup, or when source files have changed and you need to update mappings.
Quick Rerun	Runs Steps 1 and 3 (Clustering + Match) back-to-back immediately, reading all settings from the Config sheet. No form is displayed.	Daily re-run with a new vendor name batch when Oracle data has not changed.
Step 1 — Name Clustering	Runs the clustering algorithm only. Reads the vendor name source, groups similar names, and writes VN_Clusters.	When you need to re-cluster after the vendor name file changes, without re-running the match (e.g. to check clustering quality before committing a full run).
Step 2 — Oracle Reduce	Reads the Oracle extract, deduplicates at the Vendor Number + Vendor Name level, and writes Oracle_Unique.	When a new quarterly Oracle extract arrives.
Step 2b — Inactive Vendor Reduce	Reads the Inactive Vendors file and writes Inactive_Unique. Skipped automatically if no inactive file is configured.	When a new quarterly inactive vendors report arrives.
Step 3 — Vendor Match	Matches each cluster against Oracle_Unique (active vendors) and then against Inactive_Unique (if available). Writes VN_Matched and VN_Review.	After clustering is complete, to produce the final review sheets.
Run All (Steps 1 + 2 + 2b + 3)	Runs all four steps in sequence with a confirmation prompt. Step 2b is skipped if not configured. This is equivalent to clicking each step button in order.	Full refresh when both the vendor name source and Oracle data have changed.
Setup Workbook	Re-creates or refreshes the Run, Config, and Log sheets. Safe to re-run at any time.	First-time setup, or after importing the macro modules into a new workbook.
Archive Results	Exports all output sheets (plus a Summary tab) to a clean, macro-free `.xlsx` file for handoff to the junior analyst.	When the first-pass review is complete and the data is ready to hand off.

7 Step 1 — Name Clustering (How It Works)

What this step produces

Step 1 reads the vendor name source, groups similar names into numbered clusters, and writes the results to the VN_Clusters sheet.

Columns in VN_Clusters

Column	Description
`CLUSTER_ID`	An integer that uniquely identifies a group of similar names. All rows sharing the same CLUSTER_ID were determined to be similar enough to represent the same vendor.
`VENDOR_NAME`	The original vendor name as it appeared in the source file, with no modification.
`OCCURRENCE_COUNT`	How many times this exact name appeared in the source file.
`RANK`	Rank within the cluster by occurrence count. Rank 1 = most frequent name in the cluster, which is used as the cluster's representative name in downstream sheets.
`CLUSTER_SIZE`	Total number of distinct name variants in this cluster (same value on every row of the cluster). A CLUSTER_SIZE of 1 means the name appeared in only one form — no duplicates were detected.

How the algorithm groups names

The engine uses a sorted sliding-window union-find approach:

All unique vendor names are cleaned and normalised (case, punctuation, common suffixes such as Ltd, Inc, LLC removed).
Names are sorted alphabetically. This ensures that similar names end up adjacent in the sorted list, making the sliding window effective.
For each name, the engine compares it against the next N names in the sorted list (the window). N defaults to 150 and is configurable.
Each pair of names is scored using a combined similarity metric (Levenshtein distance and a token-sort comparison). Pairs scoring at or above the threshold (default 0.85) are merged into the same cluster.
This pass is performed twice — once sorted by cleaned name, once sorted by a token-sorted key. The second pass catches reordered names such as EATON CHAR-LYNN vs CHAR-LYNN EATON that may be far apart alphabetically but close in token-sorted order.
A complete-linkage guard prevents chain-explosion: before merging two names, the engine also verifies that their respective cluster roots score above the threshold. This prevents a large catch-all cluster from forming simply because A ~ B and B ~ C, when A and C are completely unrelated.

Example cluster output

CLUSTER_ID 316 — CLUSTER_SIZE 12

ATLAS GRAHAM (RANK 1, occ: 4148)

ATLAS GRAHAM LTD

Atlas Graham

ATLAS GRAHAM LIMITED

atlas graham

ATLAS-GRAHAM

Atlas Graham Ltd.

... 6 more variants

Single-name clusters are expected. A large number of CLUSTER_SIZE 1 entries is normal — it means those vendor names appear in only one consistent form and have no detected near-duplicates. The review work focuses on clusters with CLUSTER_SIZE > 1.

Processing time

For a dataset of approximately 5,000 unique names with the default window of 150, Step 1 typically completes in 3–8 minutes depending on hardware. The progress bar shows elapsed time and percentage completion. The bar resets to 0% between the two clustering passes — this is normal behaviour.

8 Step 2 — Oracle Reduce

What this step produces

Step 2 reads the Oracle vendor master extract and collapses it to one row per unique Vendor Number + Vendor Name combination, writing the result to the Oracle_Unique sheet.

Why deduplication is needed

The raw Oracle extract typically contains one row per transaction or account line, meaning the same Vendor Number + Vendor Name pair may appear thousands of times. Step 3's matching logic requires a clean, non-redundant reference list. Step 2 reduces a file that might contain 500,000 rows to a manageable list of perhaps 50,000–60,000 unique vendor records.

Columns in Oracle_Unique

Column	Description
`VENDOR_NUMBER`	The Oracle vendor number as it appeared in the source file. Leading zeros are preserved.
`VENDOR_NAME`	The Oracle vendor name associated with this vendor number in the extract.
`OCCURRENCE_COUNT`	How many times this Vendor Number + Vendor Name combination appeared in the source extract.

Tip — this step is quarterly. Once Oracle_Unique has been produced from the current Oracle extract, you do not need to re-run Step 2 until the next Oracle extract arrives. Repeated daily runs only need Steps 1 and 3.

9 Step 2b — Inactive Vendor Reduce (Optional)

What this step produces

Step 2b reads the Inactive Vendors file (a separate Oracle report listing deactivated vendors) and writes one row per unique Vendor Number + Vendor Name pair to the Inactive_Unique sheet.

Skipping Step 2b

If cell C29 (Inactive File Name) on the Config sheet is blank, Step 2b is automatically skipped — both when run individually and when triggered by Run All. The VN_Matched and VN_Review sheets produced by Step 3 will omit the inactive-vendor matching phase (Phase B), and the VENDOR_INACTIVE_DATE column will be blank for all rows.

Merged title row handling

Some Oracle inactive vendor reports have a merged title row in row 1, with actual column headers in row 2. The engine detects this automatically — if row 1 appears to be a single merged cell or does not contain recognisable header values, it advances to row 2 before scanning for headers.

Columns in Inactive_Unique

Column	Description
`VENDOR_NUMBER`	Oracle vendor number.
`VENDOR_NAME`	Vendor name as it appears in the inactive report.
`VENDOR_INACTIVE_DATE`	The date on which the vendor was deactivated in Oracle.

10 Step 3 — Vendor Match

What this step produces

Step 3 takes every cluster from VN_Clusters and attempts to find a matching Oracle vendor record for each name variant within the cluster. It writes two output sheets:

VN_Matched — full results, one row per name variant
VN_Review — summarised and grouped view for the analyst

Matching logic — two phases

For each name variant within a cluster, the engine attempts to find a match in two phases:

Phase A — Active Oracle match. The cleaned form of the name variant is compared against every cleaned name in Oracle_Unique. An exact match (after normalisation) assigns the Oracle vendor number and sets MATCH_NOTE = "exact match".
Phase B — Inactive vendor match. If Phase A finds no active Oracle match, the engine checks Inactive_Unique (if available). A match there sets MATCH_NOTE = "inactive vendor" and includes the inactive date. This outcome signals that the vendor exists in Oracle but has been deactivated — the AP team should verify whether payments are still being processed.
If neither phase finds a match, the row is written with MATCH_NOTE = "no match found". This typically indicates a new vendor not yet set up in Oracle, or a name variant with too many differences to match after normalisation (such as a severe abbreviation or typing error).

Step 3 requires Oracle_Unique to exist. If Step 2 has not been run since the workbook was last set up, Step 3 will fail or produce incomplete results. Always run Step 2 at least once per Oracle extract cycle before running Step 3.

11 Understanding the Output Sheets

VN_Clusters — raw cluster data

Produced by Step 1. One row per unique vendor name variant. AutoFilter enabled. Use this sheet to verify clustering quality — filter by CLUSTER_SIZE > 1 to see all multi-variant groups, or filter by a specific CLUSTER_ID to see all variants in one cluster.

Oracle_Unique — active vendor reference

Produced by Step 2. One row per unique Vendor Number + Vendor Name pair in Oracle. AutoFilter enabled. Use this as a reference when cross-checking vendor numbers in the review process.

Inactive_Unique — inactive vendor reference

Produced by Step 2b. One row per unique inactive Vendor Number + Vendor Name pair. AutoFilter enabled. Columns: VENDOR_NUMBER, VENDOR_NAME, VENDOR_INACTIVE_DATE.

VN_Matched — full match results

Produced by Step 3. One row per name variant × Oracle match combination. This is the flat, unformatted version of the results — it contains every row including variants with no Oracle match. AutoFilter enabled.

Column	Description
`CLUSTER_ID`	Cluster number from Step 1.
`CLUSTER_REP_NAME`	The most-frequently-occurring name in the cluster (Rank 1 from VN_Clusters).
`CLUSTER_TOTAL_COUNT`	Total occurrence count across all name variants in the cluster.
`VENDOR_NUMBER`	Oracle vendor number found by Phase A or B. Blank if no match.
`ORACLE_VENDOR_NAME`	The vendor name as it appears in Oracle_Unique or Inactive_Unique. Blank if no match.
`MATCH_NOTE`	exact match — active Oracle vendor found inactive vendor — found in Inactive_Unique only no match found — not found in either reference
`VENDOR_INACTIVE_DATE`	Inactive date from Inactive_Unique. Populated only for "inactive vendor" rows.

12 The VN_Review Sheet — Daily Analyst Workspace

VN_Review is the primary working sheet for the AP analyst. It shows only multi-member clusters (CLUSTER_SIZE > 1) — single-name clusters are excluded as they require no review. It presents results in a collapsed, grouped view that allows the analyst to triage quickly at the cluster level and then drill into individual variants as needed.

Row structure

Each cluster is represented by two types of rows:

Summary row (bold, always visible): shows the CLUSTER_ID, representative name, and total occurrence count for the cluster. This row is visible even when the cluster is collapsed, giving the analyst enough information to decide whether to expand and review the cluster in detail.
Detail rows (grouped, collapsed by default): one row per name variant, showing the specific Oracle match result for that variant (vendor number, Oracle name, MATCH_NOTE, inactive date). Expand by clicking the + button on the left margin, or by pressing Alt + Shift + =.

Colour coding

Colour	Meaning
White / alternating blue — summary rows	Standard multi-variant cluster. Contains spelling variants but no flags have been triggered.
Amber / yellow — summary row	The cluster has 2 or more distinct Oracle vendor numbers (configurable via `C36`). This signals that what may be one real-world vendor is recorded under multiple Oracle IDs — a potential duplicate vendor setup.
Red — summary row	The cluster is amber-flagged AND the total occurrence count is at or above the high-volume threshold (default 7, configurable via `C37`). These are high-priority clusters for immediate review.
Gray — detail rows	Standard variant rows within a cluster.

How an analyst works through VN_Review

Start with the red rows. These are high-volume clusters with multiple Oracle vendor numbers — they carry the highest risk of duplicate payments. Click + to expand and verify whether the variants represent the same legal entity.
Work through amber rows next. Multiple vendor numbers within one cluster may be legitimate (e.g. separate operating divisions) or may indicate a data entry error. Confirm with the AP team or vendor master.
Review "no match found" variants. Filter the MATCH_NOTE column for no match found. These names did not match any active or inactive Oracle record after normalisation. Possible causes: new vendor not yet set up, abbreviation too extreme, or a typo so severe that the normalisation did not resolve it.
Flag "inactive vendor" rows for AP team verification. If a name variant matches an inactive Oracle record, payment may still be processing against a deactivated vendor. The VENDOR_INACTIVE_DATE column shows when the vendor was deactivated.

AutoFilter is enabled on VN_Review. You can filter by MATCH_NOTE, CLUSTER_ID, or any other column. Note that filtering will hide grouped rows — collapse all groups before applying a filter for best results. Use Data > Group > Show Detail or press Alt + Shift + = to restore groups after filtering.

13 Archiving Results for the Junior Analyst

When the first-pass review on VN_Review is complete and the data is ready for granular verification by the junior analyst, use the Archive Results button to export a clean handoff package.

What the archive contains

The archive is a standard Excel .xlsx workbook with no macros. It contains:

A Summary tab (first position) — see Section 14
VN_Review (second position) — the collapsed analyst workspace
VN_Clusters — raw cluster data
Oracle_Unique — active vendor reference
Inactive_Unique — inactive vendor reference (if Step 2b was run)
VN_Matched — full flat match results

The Run, Config, and Log sheets are excluded — the junior analyst does not need configuration data or the macro infrastructure.

Archive procedure

1	Click Archive Results on the Run sheet. If VN_Review does not exist (Steps 1 and 3 have not been run), the macro will display an error and stop.
2	The Archive dialog opens, pre-filled with the save folder and base file name from the Config sheet (`C40`/`C41`). Save to folder — type a path directly or click Browse to use a folder picker. Base file name — the name without date suffix, e.g. `VNE_Review`. The File preview at the bottom of the dialog shows the exact file name that will be created, updated as you type. If the file already exists, the preview notes that a version suffix will be appended.
3	Click Archive. The macro copies the output sheets, builds the Summary tab, and saves the file. A confirmation message shows the full path of the saved file.

File naming

The archive file name is constructed as follows:

BaseName_DD_MMM_YYYY.xlsx

For example: VNE_Review_14_MAR_2026.xlsx

If that file already exists in the chosen folder, a version suffix is appended automatically:

VNE_Review_14_MAR_2026_V1.xlsx
VNE_Review_14_MAR_2026_V2.xlsx (if V1 also exists)

Your chosen folder and base name are saved back to the Config sheet for use as the default in the next archive run.

Tip — folder creation. If the specified save folder does not exist, the macro will attempt to create it automatically. If the folder cannot be created (e.g. due to network permissions), the macro will display an error and stop before any file is written.

14 The Archive Workbook — Summary Tab

The Summary tab is generated automatically and placed as the first sheet in the archive workbook. It is designed to give the junior analyst an immediate orientation to the dataset they are about to review, without requiring them to open any other tab first.

Summary tab content

Cluster Analysis

A statistics block covering:

Total vendor names processed from the source file
Total clusters identified
Single-name clusters (unique vendors with no detected variants)
Multi-variant clusters (2+ spelling variations — the primary review area)
Large clusters (size ≥ threshold — high-priority items)

Oracle Match Summary

Total active Oracle unique vendors in reference
Inactive vendor records loaded (if Step 2b was run)
Total rows in VN_Matched, broken down by MATCH_NOTE value

What to Expect in This Workbook

A guide section explaining each tab the analyst will encounter, with a plain-language description of its content and purpose.

Top Clusters by Variant Count

A table listing the 10 clusters with the most distinct name variants. Each CLUSTER_ID in this table is a hyperlink — clicking it navigates directly to that cluster's summary row in the VN_Review tab, allowing the analyst to jump straight to the highest-complexity items.

Clusters that meet the amber threshold are highlighted in amber. Rows in the table that also meet the red threshold are highlighted in amber (the full red treatment is on VN_Review itself).

Charts

Three charts are generated automatically:

Top 10 Clusters by Variant Count — horizontal bar chart showing the 10 largest clusters ranked by the number of distinct name variants.
Oracle Match Status — pie chart breaking down the VN_Matched rows by MATCH_NOTE value (exact match / inactive vendor / no match found). A large "no match" slice warrants investigation into whether Oracle vendor setup is complete.
Cluster Size Distribution — column chart showing how many clusters fall into each size band (Size 1, Size 2, Size 3–4, Size 5–6, Size 7–9, Size 10+). A healthy dataset has the vast majority of clusters at Size 1, with a small tail of larger clusters.

A chart data area (columns H–I on the Summary sheet) contains the raw numbers behind each chart in tabular form. The CLUSTER_ID values in this table are also hyperlinked to VN_Review.

The Summary tab is read-only context. The junior analyst should not edit this tab. Its purpose is orientation, not data entry. All review work is done on the VN_Review tab.

15 Cancelling a Run in Progress

While any step is running, the progress bar form displays a Cancel button. Clicking it sets an internal abort flag. The engine checks this flag at the end of each step — the current step will complete normally, and the engine will stop before starting the next step.

Cancel completes the current step before stopping. For Step 1 (Name Clustering), which may take several minutes, pressing Cancel will not interrupt the current clustering pass immediately. The stop will occur after the current pass finishes. If you need to stop immediately, close the workbook and reopen it without saving.

If a run is cancelled mid-way through a multi-step sequence (e.g. Quick Rerun cancels after Step 1 completes but before Step 3), the VN_Clusters sheet will have been updated with fresh Step 1 results, but VN_Matched and VN_Review will still reflect the previous run. Re-run Step 3 individually to bring those sheets up to date.

16 The Log Sheet

The Log sheet records a timestamped entry for every macro run. It is written automatically and does not require any action from the user.

Column	Content
`Timestamp`	Date and time the entry was written.
`Source`	The module that wrote the entry (e.g. `VNE5_NameCluster`, `VNE9_Archive`).
`Message`	A summary of the outcome, e.g. OK: 4604 clusters, 5434 rows processed, or an error description if the step failed.

Use the Log sheet for auditing. The Log provides a record of when each step was run, what source file was read, and how many records were processed. This is useful when reconciling discrepancies between archive versions or confirming which Oracle extract was used for a given review cycle.

17 Algorithm Reference — For Advanced Users

This section describes the internal mechanics of the clustering algorithm for users who need to interpret unusual results or adjust configuration parameters.

Name normalisation (CleanVendorName)

Before any comparison, each vendor name is normalised:

Convert to uppercase.
Strip common legal suffixes (LTD, LIMITED, INC, LLC, PLC, CORP, CORPORATION, CO, and others).
Remove all punctuation except internal spaces.
Collapse multiple spaces to a single space and trim.

CAPITA PLC, Capita Ltd, and CAPITA LIMITED all normalise to CAPITA and therefore match exactly in Step 3.

Similarity scoring (ScorePair)

The score between two normalised names is computed as follows:

Common suffix token stripping. Tokens that appear at the end of both names (e.g. both names end in DRILLING SUPPLY) are stripped before scoring. This prevents a high score for names that share only a generic industry suffix, ensuring the discriminating prefix receives full weight.
Levenshtein similarity. A standard edit-distance similarity score normalised to the range 0–1, where 1 = identical and 0 = no characters in common.
Prefix boost. If one name is a prefix of the other (e.g. ATLAS vs ATLAS GRAHAM), the score receives a small upward adjustment. This boost applies to genuine prefix matches only — it does not apply when the shared portion occurs in the middle or end of both names.
Final score. The higher of the Levenshtein similarity and the prefix-boosted score is used. Names scoring ≥ 0.85 (default) are candidates for merging.

Complete-linkage guard

Without this guard, a chain of transitively similar names can collapse into a single enormous cluster: if A ~ B and B ~ C but A and C are unrelated, all three would merge. The VNE prevents this by requiring that the cluster roots (the representative names of each cluster being merged) also score above threshold. Two single-name entries are always allowed to merge — the guard only applies when at least one side is already a multi-member cluster.

Scenario: why a pair did not cluster

If you believe two names should be in the same cluster but are not, possible causes are:

The normalised forms score below 0.85. Try lowering C23 to 0.80 and re-running Step 1 — but validate that this does not create false merges elsewhere.
The names are far apart in sorted order and fell outside the sliding window. Try increasing the window size in C25.
The complete-linkage guard blocked the merge because the names' current cluster roots score below threshold. This is by design — it means accepting the merge would introduce a chain-explosion risk.

Scenario: why a cluster is too large / contains unrelated names

If a single cluster contains names that appear unrelated, possible causes are:

The threshold is too low for the data. Consider raising C23 to 0.87 or 0.90.
The names share a common generic word after suffix stripping (e.g. all names in the cluster contain the word SERVICES). Review whether the suffix stripping list should include that word for this dataset.
Window size is too large, introducing comparisons between names that are semantically unrelated but coincidentally similar in character sequence.

18 Troubleshooting & Common Scenarios

Error: "The VN_Review sheet was not found"

You clicked Archive Results but Step 3 has not been run yet. Run Step 3 (or Quick Rerun) before archiving.

Error: "Config sheet not found — run SetupWorkbook first"

The workbook has not been set up. Press Alt + F8, select VNE3_Setup.SetupWorkbook, and run it.

Error: "Vendor Name Column Header is required"

Cell C7 on the Config sheet is blank. Enter the exact column header of the vendor name column in your source file (e.g. VENDOR_NAME) and try again.

Error: "File not found" when running a step

The file path and/or file name on the Config sheet does not point to an existing file. Verify that the folder and file name are correct, that the file is not currently open in another Excel instance, and that you have read access to the network location.

The progress bar jumps to 100% and then resets to 0%

This is normal behaviour during Step 1. The clustering algorithm runs two passes over the data (Pass A and Pass B). The progress bar completes 0→100% for each pass. The brief reset between passes is intentional — it indicates that the first pass completed successfully and the second pass has started.

Step 1 takes a very long time

Processing time for Step 1 scales with the number of unique vendor names × the window size. For datasets above 10,000 unique names, consider reducing the window size from 150 to 75 in C25. This reduces recall (some distant variant pairs may not be caught) but significantly reduces processing time. For very large datasets (>30,000 unique names), reduce to 50.

VN_Review shows no rows at all

Step 3 filters VN_Review to show only clusters with CLUSTER_SIZE > 1. If every name in the source file is unique (no repeated or similar vendor names), all clusters will have size 1 and VN_Review will be empty. Check VN_Clusters to confirm whether any multi-member clusters were produced.

All rows in VN_Review show "no match found"

This typically means Oracle_Unique was produced from a different Oracle extract than the vendor name source — the vendor names in the source do not correspond to any active Oracle records in the current Oracle extract. Verify that Steps 2 and 3 are reading from the correct files and that the column headers are mapped correctly.

The archive file name contains the wrong date

The date in the archive file name is the date and time the archive macro was run — not the date of the underlying data. If you need the data date in the file name, edit the base file name in the Archive dialog to include it manually, e.g. VNE_Review_Q1_2026.

Cluster IDs differ between runs on the same data

CLUSTER_ID values are assigned sequentially in the order clusters are first encountered during output writing. They are not stable identifiers — the same group of vendor names may receive a different CLUSTER_ID in a subsequent run if the sort order or grouping changes. Do not use CLUSTER_ID as a persistent key in external systems.

A vendor that should be inactive is showing as "exact match" (active)

If a vendor has been deactivated in Oracle but the current Oracle_Unique sheet was produced before the deactivation (i.e. Step 2 was last run before the vendor was deactivated), the vendor will still appear as active in the reference data. Re-run Step 2 with the latest Oracle extract to resolve this.

19 Glossary

Term	Definition
Cluster	A group of vendor name variants that the algorithm has determined are similar enough to represent the same real-world vendor. Identified by a CLUSTER_ID.
Cluster size	The number of distinct name variants in a cluster. CLUSTER_SIZE = 1 means only one spelling was found; CLUSTER_SIZE = 12 means twelve different spellings were grouped together.
Representative name	The name variant within a cluster with the highest occurrence count (RANK = 1). Used as the cluster label in VN_Matched and VN_Review.
Exact match	A vendor name variant whose normalised form matched a record in Oracle_Unique exactly. Displayed in MATCH_NOTE as "exact match".
Inactive vendor	A vendor name variant that matched a record in Inactive_Unique but not in Oracle_Unique. The vendor exists in Oracle but has been deactivated.
No match found	A vendor name variant that did not match any record in Oracle_Unique or Inactive_Unique after normalisation. May indicate a new vendor, a severe abbreviation, or a data entry error.
Complete-linkage guard	An internal safeguard that prevents the clustering algorithm from forming extremely large catch-all clusters by requiring that cluster roots (not just leaf members) also score above the similarity threshold before a merge is permitted.
Sliding window	The technique used in Step 1 to limit the number of pair comparisons. Only names within a window of N adjacent entries in the sorted list are compared, rather than every possible pair (which would be computationally prohibitive).
Oracle extract	A full export of the Oracle vendor master, typically produced quarterly by the Finance or IT team. Used in Step 2 to build Oracle_Unique.
Inactive vendors file	A separate Oracle report listing vendors with a deactivation date. Used in Step 2b to build Inactive_Unique.
Archive workbook	The macro-free `.xlsx` file produced by the Archive Results function. Intended for distribution to the junior analyst for granular review and manual verification against Oracle.
Quick Rerun	A single-click entry point that runs Steps 1 and 3 back-to-back using the settings already saved on the Config sheet. No form is displayed.
Token-sort key	A form of the vendor name where the individual words (tokens) have been sorted alphabetically before being re-joined. Used in Pass B of the clustering algorithm to catch names whose words appear in different orders.

Vendor Normalization Engine | Job Aid v1.0 | March 2026 | For internal use only | To report issues or request enhancements, contact Darek Krawczynski. ytcodes.dev

Vendor Normalization Engine

Contents

1 What is the Vendor Normalization Engine?

2 Prerequisites — What You Need Before Running

2.1 Excel version and macro security

Supported Excel versions

Enabling macros when you open the workbook

Adding a Trusted Location (recommended for regular use)

If the Developer tab is not visible

2.2 Input file 1 — Name Source (required for Step 1)

What it is

Minimum requirements

Supported formats

File stored on a network drive

File name changes daily

Source data in this workbook

2.3 Input file 2 — Reference Extract (required for Step 2)

What it is

Minimum requirements

Column header examples by use case

File size considerations

Duplicate rows in the Reference Extract

2.4 Input file 3 — Inactive / Deactivated Records (optional, Step 2b)

What it is

Minimum requirements (if used)

Merged title row handling

When to run Step 2b

2.5 Step independence — what needs re-running and when

3 Quick Start — Typical Workflow

4 First-Time Setup (SetupWorkbook)

4.1 Run SetupWorkbook

4.2 Assign macros to buttons on the Run sheet

4.3 Save the workbook in macro-enabled format

4.4 What SetupWorkbook creates

5 The Config Sheet — All Settings Explained

5.1 Vendor Name Source — rows 4–7 (Step 1 input)

C4 — Vendor Source Path

C5 — Vendor Source File Name

C6 — Vendor Source Sheet Name

C7 — Vendor Name Column Header (Required)

5.2 Reference Extract — rows 10–14 (Step 2 input)

C10 — Reference File Path

C11 — Reference File Name

C12 — Reference Sheet Name

C13 — Identifier Column (Required)

C14 — Name Column (Required)

5.3 Output Sheet Names — rows 17–20

5.4 Algorithm Tuning — rows 23–25

C23 — Cluster Similarity Threshold (default: 0.85)

C24 — Match Similarity Threshold (default: 0.82)

C25 — Clustering Window Size (default: 150)

5.5 Inactive / Deactivated Records — rows 28–33 (Step 2b input)

C28 — Inactive File Path

C29 — Inactive File Name (leave blank to skip Step 2b)

C30 — Inactive Sheet Name

C31 — Inactive Name Column (Required if C29 is filled)

C32 — Inactive Identifier Column (Required if C29 is filled)

C33 — Inactive Date Column (Required if C29 is filled)

5.6 Review Highlight Thresholds — rows 36–37

C36 — Minimum distinct identifiers to flag (default: 2)

C37 — Red escalation if total occurrences ≥ (default: 7)

5.7 Archive Settings — rows 40–41

C40 — Archive Save Folder

C41 — Archive Base File Name

5.8 Config sheet — minimum viable configuration

6 The Run Sheet — Button Reference

7 Step 1 — Name Clustering (How It Works)

What this step produces

Columns in VN_Clusters

How the algorithm groups names

Example cluster output

Processing time

8 Step 2 — Oracle Reduce

What this step produces

Why deduplication is needed

Columns in Oracle_Unique

9 Step 2b — Inactive Vendor Reduce (Optional)

What this step produces

Skipping Step 2b

Merged title row handling

C23 — Cluster Similarity Threshold (default: `0.85`)

C24 — Match Similarity Threshold (default: `0.82`)

C25 — Clustering Window Size (default: `150`)

C36 — Minimum distinct identifiers to flag (default: `2`)

C37 — Red escalation if total occurrences ≥ (default: `7`)