User Manual & Job Aid | Version 1.0 | March 2026
The Vendor Normalization Engine (VNE) is an Excel macro workbook that identifies duplicate and near-duplicate names within any large list, groups those names into similarity clusters, and cross-references each cluster against a system extract to surface matching identifiers and flag unrecognised entries. The result is a structured, colour-coded review package ready for analyst sign-off.
The tool was built with Accounts Payable supplier lists as the primary use case, but it is not limited to that context. Any dataset that provides a name column and a corresponding identifier column can serve as both the source list and the reference extract. Practical applications include:
The reference extract (Oracle or otherwise) is connected via two configurable column headers on the Config sheet — one for the name column and one for the identifier column. No column names are hard-coded in the matching logic. The labels Vendor Number and Vendor Name that appear throughout the output sheets are the tool's internal terminology for the identifier and name roles respectively, regardless of what those columns are called in your source data.
In a large organisation, names accumulate over time with inconsistent spelling, punctuation, capitalisation, and abbreviations. A single supplier such as Capita may appear as CAPITA PLC, Capita Ltd, CAPITA LIMITED, and Capita, (with a trailing comma), all referring to the same entity. Without automated grouping, an analyst must identify these variants manually row by row — a process that is both time-consuming and error-prone at scale.
The VNE automates this by scoring every pair of names for similarity, merging similar names into clusters, and then attempting to match each cluster to a record in the reference extract. The final output is a colour-coded, collapsible worksheet that an analyst can work through systematically to confirm whether clustered names represent the same entity, flag potential duplicates, and verify system identifiers.
Before running any step, confirm that your Excel environment is ready and that you have the required input files available in an acceptable format. This section covers every prerequisite in detail, including what each file must contain, where it can be stored, and how to handle common variations in file structure.
The VNE is compatible with Excel 2016, 2019, 2021, and Microsoft 365 on Windows. It is not compatible with Excel for Mac (the FileDialog folder picker and certain charting methods used by the Archive module are Windows-only). Excel Online (browser) is not supported — macros do not run in the browser version.
When you open the .xlsm file, Excel displays a yellow security bar below the
ribbon that reads "SECURITY WARNING — Macros have been disabled." Click
Enable Content on that bar. Without this, no buttons or macros will
function. You will need to click Enable Content each time you open the file unless you
add the file's folder to Excel's Trusted Locations (see below).
If the workbook is stored in a location you control (local drive or a specific network share), you can prevent the security prompt from appearing on every open:
The Developer tab is needed to assign macros to buttons after running SetupWorkbook. To show it: File > Options > Customise Ribbon → tick Developer in the right-hand list → click OK.
The Name Source is any file that contains the raw list of names you want to cluster and analyse. In an AP context this is typically a daily or weekly transaction extract, a vendor master export, or a purchase order line-item dump. In other contexts it might be a manufacturer directory, a customer list, or an item description export.
C7 (e.g. enter A to use column A).| Format | Extension | Notes |
|---|---|---|
| Excel Workbook | .xlsx |
Standard format. Recommended. |
| Excel Macro-Enabled | .xlsm |
Accepted. Macros in the source file are not executed. |
| Excel 97–2003 | .xls |
Accepted. May be slower to open for large files. |
| CSV (comma-separated) | .csv |
Accepted. Opened in text mode to preserve leading zeros. All columns treated as text — no automatic date or number conversion. |
Network paths are fully supported. Enter the UNC path in the Config sheet, e.g.
\\server\shares\AP\Daily. The machine running the VNE must have read
access to that path. If the path requires authentication, ensure you are connected to
the network share before running the macro.
Many AP extracts are named with a date stamp, e.g. AP_Extract_14MAR2026.xlsx,
AP_Extract_15MAR2026.xlsx. You must update C5 on the Config
sheet each day to reflect the new file name, or use the Open Wizard
button to re-browse to the new file. The folder path in C4 only needs
to be set once if the file always lands in the same folder.
If you paste your name list directly into a sheet inside the VNE workbook, you can
skip the external file entirely. Leave C4 and C5 blank.
Enter the sheet tab name in C6 (or leave C6 blank to
auto-detect by column header). The engine will read from that sheet without opening
any external file.
The Reference Extract is the system-of-record file used to look up an official identifier for each name cluster. In an AP context this is the Oracle vendor master export. In other contexts it might be a manufacturer code list, a customer master export, or an item master — any file that pairs a name with a system identifier.
C13.C14.C13/C14.| Use case | Identifier column (C13) | Name column (C14) |
|---|---|---|
| AP supplier list | VENDOR_NUMBER | VENDOR_NAME |
| Manufacturer directory | MFR_CODE | MANUFACTURER_NAME |
| Customer master | CUSTOMER_ID | CUSTOMER_NAME |
| Item master | ITEM_NUMBER | ITEM_DESCRIPTION |
| No header row (column B = ID, column C = Name) | B | C |
The Reference Extract can be very large — Oracle vendor master exports of 500,000+ rows are common and are handled without issue. Step 2 reads the entire file into memory and deduplicates it, so the primary constraint is available RAM. On a machine with 8 GB RAM, files up to approximately 1 million rows process without difficulty. For files above that size, ensure no other large applications are running concurrently.
The extract almost certainly contains duplicate identifier + name combinations (one row per transaction or account line is typical). This is expected — Step 2 exists precisely to deduplicate the extract before Step 3 uses it. You do not need to pre-clean or deduplicate the reference file.
The Inactive file is a secondary reference that lists records that have been deactivated
or retired in the system, along with the date of deactivation. In AP this is the Oracle
inactive vendors report. This file is entirely optional — if you do not have one, or do
not need inactive matching, leave C29 blank and Step 2b will be skipped
automatically.
C31.C32.C33. The date value is preserved as-is in the output (no reformatting).Some Oracle inactive vendor reports are formatted with a large merged title cell in row 1 (e.g. "Oracle AP Inactive Suppliers — Exported 14-Mar-2026") and the actual column headers in row 2. The engine detects this automatically: if row 1 does not contain recognisable column headers, it advances to row 2 before scanning. You do not need to manually delete the title row from the file before running.
Step 2b only needs to be re-run when a new inactive vendors file arrives. In most organisations this is quarterly, coinciding with the full Oracle extract refresh. Once Inactive_Unique exists in the workbook, it remains available for all subsequent Step 3 runs until it is overwritten by a new Step 2b run.
| Trigger event | Steps to re-run | Steps to skip |
|---|---|---|
| New daily / weekly name source batch | Step 1, Step 3 (or use Quick Rerun) | Steps 2 and 2b — reference data is unchanged |
| New quarterly Oracle extract | Step 2, then Step 3 (and Step 1 if the name source also changed) | Step 2b if the inactive file has not changed |
| New quarterly inactive vendors file | Step 2b, then Step 3 | Steps 1 and 2 if unchanged |
| Changed a Config setting (threshold, window, column header) | Re-run the affected step and all downstream steps | Unaffected upstream steps |
| Full refresh (all files updated) | All steps in order: 2 → 2b → 1 → 3 (or use Run All) | Nothing |
First-time configuration takes approximately 10 minutes. Every subsequent daily run takes a single click via Quick Rerun.
| 1 | Open the workbook. Go to the Run sheet. If the Run and Config sheets do not yet exist, click Setup Workbook in Developer > Macros, or see Section 4. |
| 2 | Fill in the Config sheet. At minimum, provide: vendor name column header, Oracle vendor number column header, Oracle vendor name column header. Point each section to the correct source file. See Section 5 for every field. |
| 3 | Run Step 2 — Oracle Reduce. This is a one-time (quarterly) step that reads the Oracle extract and produces the Oracle_Unique reference sheet. Click Step 2 — Oracle Reduce on the Run sheet. |
| 4 | (Optional) Run Step 2b — Inactive Vendor Reduce. If an inactive vendors file is configured, click Step 2b — Inactive Vendor Reduce. This produces the Inactive_Unique reference sheet. |
| 5 | Run Step 1 — Name Clustering. This reads the daily vendor name source and groups similar names. Click Step 1 — Name Clustering on the Run sheet. A progress bar will appear. For large datasets this step may take several minutes. |
| 6 | Run Step 3 — Vendor Match. This matches each cluster to Oracle vendor numbers and produces both VN_Matched and VN_Review. |
| 7 | Review the results. Open VN_Review. Each cluster appears as a bold summary row with variant detail rows collapsed underneath. Click + to expand a cluster and inspect the spelling variants and their Oracle match status. |
| 8 |
Archive and hand off. When the first-pass review is complete and the
data looks ready for the junior analyst, click Archive Results on the
Run sheet. The engine exports a clean, macro-free .xlsx workbook to the
folder of your choice. See Section 13.
|
The VNE workbook ships as a macro-enabled file (.xlsm) containing all the
VBA modules but none of the working sheets. The first time you open it, the
Run, Config, and Log sheets do not yet exist. A one-time
setup procedure creates those sheets, populates all default labels and values, and
prepares the workbook for use. This takes approximately two minutes.
| 1 |
Open the .xlsm file. Click Enable Content on the
yellow security bar if prompted (see Section 2.1).
|
| 2 |
Press Alt + F8 to open the Macro dialog. Alternatively: go to Developer > Code > Macros. If you do not see the Developer tab, go to File > Options > Customise Ribbon and tick Developer. |
| 3 |
In the macro list, locate and select VNE3_Setup.SetupWorkbook. If the list is long, type Setup in the search box at the top to filter it.
Click Run.
|
| 4 |
The macro runs silently for a few seconds, then displays a confirmation message: "Workbook set up. Run sheet → assign macros to buttons. Config sheet → fill in column C. Log sheet → run history is written here automatically." Click OK. |
| 5 | Three new sheet tabs will now be visible at the bottom of the workbook: Run (amber tab), Config (light-blue tab), and Log. Navigate to the Run sheet to continue. |
SetupWorkbook creates nine button shapes on the Run sheet and writes the target macro name next to each one (column D). The buttons are not yet connected to the macros — you must assign them manually once. This is a one-time step per workbook.
| 1 | On the Run sheet, right-click the first button shape. A context menu appears. Choose Assign Macro… | ||||||||||||||||||||
| 2 | In the Assign Macro dialog, locate the macro name shown in column D next to that button. Click it in the list, then click OK. The button is now linked. | ||||||||||||||||||||
| 3 |
Repeat for each button. The full assignment table is:
|
||||||||||||||||||||
| 4 | Save the workbook (Ctrl + S). Button assignments are saved with the file. |
VNE1_Config.bas through VNE9_Archive.bas) are present in the
VBA Editor (Alt + F11, check the left-hand Project tree).
The workbook must be saved as .xlsm (Excel Macro-Enabled Workbook) to
retain the VBA modules and button assignments. If you save it as .xlsx,
Excel will strip out all macros.
To confirm the file type: check the title bar — it should show the file name ending in
.xlsm. If you see a "Some features may be lost" warning when saving,
choose Keep Current Format (not Use Office Open XML Format).
| Sheet | Tab colour | Purpose | User edits? |
|---|---|---|---|
| Run | Amber | Nine button shapes, each labelled with its function and the corresponding macro name. This is the primary operating screen — all day-to-day interactions happen here. | No — do not add data to this sheet. Button labels are refreshed by SetupWorkbook. |
| Config | Light blue | All user-configurable settings. Column B = label (read-only), column C = your value, column D = hint / default. Every path, file name, column header, threshold, and archive setting lives here. | Yes — this is the only sheet you should edit. All settings are entered in column C. |
| Log | Default | Timestamped record of every macro run — module name, outcome, row counts, file paths. Written automatically. | No — written by the macros. You may read it and filter it but should not edit it. |
The output sheets (VN_Clusters, Oracle_Unique, Inactive_Unique, VN_Matched, VN_Review) do not exist yet after setup. They are created automatically the first time each corresponding step is run.
All user-configurable settings are stored on the Config sheet in column C. Column B contains the read-only label and column D the hint or default reminder. No settings exist anywhere else in the workbook — the Config sheet is the single point of configuration. If you need to change a path, column header, threshold, or archive setting at any point, this is the only place to edit it.
This section tells the engine where to find the raw list of names to be clustered. This is the only input consumed by Step 1.
The folder path containing the source file. Examples:
C:\AP\Daily Extracts — local drive\\fileserver\AP\Extracts — UNC network pathD:\Projects\VNE\Input — alternate local driveLeave blank if the source data is on a sheet inside this workbook (no external file is needed). Do not include a trailing backslash — the engine appends one automatically.
The exact file name including extension. Examples:
AP_Extract_14MAR2026.xlsxVendor_List_Q1_2026.csvraw_names.xlsLeave blank to read from a sheet inside this workbook. If both C4 and C5 are blank, C6 and C7 are used to locate a sheet in this workbook.
If your file name changes with each extract (e.g. date-stamped names), you must update this cell each time a new file arrives. There is no wildcard support — the exact name must be entered.
The sheet tab name within the file (external or internal). Examples:
Sheet1, AP Export, Vendor_Names.
Leave blank to auto-detect. When blank, the engine scans all sheets in the file and returns the first sheet whose header row contains the column specified in C7. This is convenient when the sheet name varies between file versions. If your file has only one data sheet, you can reliably leave this blank.
Identifies which column in your source file contains the names to cluster. Three input formats are accepted:
| Format | Example | When to use |
|---|---|---|
| Header name (text) | VENDOR_NAME |
Your file has a header row and you know the column heading. This is the most reliable option — it works regardless of column position. |
| Column letter | C |
Your file has no header row, or you prefer to specify position directly. |
| Column number | 3 |
Same as column letter but numeric. Column A = 1, B = 2, C = 3, etc. |
This cell cannot be left blank — it is the only truly required field for Step 1. If it is blank when Step 1 is run, the engine will display a validation error and stop.
vendor_name,
Vendor_Name, or VENDOR_NAME will all work if your column header is
VENDOR_NAME.
This section tells the engine where to find the system-of-record reference file used to match clusters to official identifiers. In an AP context this is the Oracle vendor master export. In other contexts it may be any file containing an identifier + name pairing.
Folder path of the reference extract, using the same format as C4. Leave blank if the data is in a sheet inside this workbook. Examples:
C:\Oracle Exports\Q1 2026\\server\finance\oracle\vendor_masterExact file name including extension. Leave blank to read from a sheet in this workbook. The reference extract is typically provided by your Finance or IT team on a quarterly basis — the file name here will need to be updated each quarter.
Sheet tab name within the file. Leave blank to auto-detect by the column headers specified in C13 and C14. If the file has multiple sheets, leaving this blank will return the first sheet that contains both required column headers.
The column in your reference file that contains the system identifier — the code or number
that uniquely identifies each record. This value is carried into the output sheets as
VENDOR_NUMBER (the tool's internal label for this role). Examples:
| Your column header | Use case |
|---|---|
VENDOR_NUMBER | Oracle AP vendor master |
SUPPLIER_ID | Alternative Oracle export format |
MFR_CODE | Manufacturer directory |
CUSTOMER_ID | Customer master |
ITEM_NUMBER | Item master |
B | No header row, identifier in column B |
00123456), use a .csv source file rather
than .xlsx for the reference extract. The engine opens CSV files in text mode,
which prevents Excel from stripping leading zeros. Alternatively, ensure the identifier
column in your .xlsx file is formatted as Text before saving.
The column in your reference file containing the official name associated with each
identifier. This is the value that Step 3 normalises and compares against the clustered
names from Step 1. The column is labelled VENDOR_NAME in the output sheets.
Enter the exact header name, a column letter, or a column number — same three formats as C7. A header name is recommended for reliability.
The name in this column does not need to be a perfect match to the names in your source list — the engine normalises both sides (strips suffixes, punctuation, case) before comparing. However, the more the source names resemble the reference names after normalisation, the higher the match rate will be.
By default, the four output sheets are named VN_Clusters,
Oracle_Unique, VN_Matched, and Inactive_Unique.
If your organisation has a naming convention or you are running the VNE for a specific
project and want descriptive tab names, you can override the defaults here.
| Cell | Output sheet | Default | When you might override |
|---|---|---|---|
C17 |
Step 1 clusters output | VN_Clusters |
Running VNE for manufacturers: rename to MFR_Clusters |
C18 |
Step 2 reference unique output | Oracle_Unique |
Non-Oracle source: rename to Ref_Unique or MFR_Master |
C19 |
Step 3 full match output | VN_Matched |
Rename to MFR_Matched for clarity in a manufacturer context |
C20 |
Step 2b inactive unique output | Inactive_Unique |
Rename to Retired_Items or Discontinued_MFR |
These three settings control the clustering algorithm's sensitivity and speed. The defaults were calibrated against real AP data and are appropriate for most use cases. Do not change them unless you have a specific reason and have validated the effect on a test dataset.
0.85)Two names must score at or above this value to be merged into the same cluster. The scale runs from 0 (completely different) to 1 (identical). In practice:
| Value | Behaviour | Risk |
|---|---|---|
0.90 – 1.00 |
Very conservative. Only near-identical names cluster together (e.g. same name with one letter difference). Most spelling variations will land in separate clusters. | High miss rate — obvious variants not grouped. |
0.85 (default) |
Balanced. Handles capitalisation, punctuation, common suffixes, and minor spelling differences. Most vendor name variations encountered in AP practice cluster correctly. | Occasional false positives in densely similar name spaces. |
0.75 – 0.84 |
More aggressive. Catches more distant variants (e.g. abbreviations, missing words). Some unrelated names with similar character sequences may be incorrectly grouped. | False positive rate rises noticeably below 0.80. |
Below 0.75 |
Very aggressive. Clusters will be large and likely to contain unrelated names. Not recommended for production use. | High false positive rate. Results unreliable. |
When to consider a lower threshold: if your source data uses extreme abbreviations (e.g. ATLS GRHMN instead of ATLAS GRAHAM), or names that are always shortened in transactions, a threshold of 0.80–0.82 may produce better results. Test on a small known dataset first.
When to consider a higher threshold: if your dataset contains many legitimate vendor names that are genuinely similar to each other (e.g. a directory of related companies such as Smith Engineering and Smith Consulting), a threshold of 0.87–0.90 may reduce false merges.
0.82)This field is reserved for future use. Step 3 currently performs exact matching only (after normalisation) — it does not perform fuzzy matching against the reference extract. The value here has no effect on current results. It is retained in the Config sheet to support a planned future enhancement that will add a fuzzy fallback pass for names with no exact match. Leave it at the default.
150)The sliding window controls how many adjacent names are compared against each entry in the sorted list. A larger window catches more distant variant pairs but increases processing time significantly (approximately proportional to window² × unique name count).
| Dataset size (unique names) | Recommended window | Approx. Step 1 time |
|---|---|---|
| Up to 5,000 | 150 (default) | 2–5 minutes |
| 5,000 – 15,000 | 150 | 5–15 minutes |
| 15,000 – 30,000 | 100 | 10–20 minutes |
| 30,000 – 60,000 | 75 | 15–30 minutes |
| Above 60,000 | 50 | 20–45 minutes |
Reducing the window means the engine will only compare each name against the nearest N names in sorted order. Names that are similar but alphabetically distant may not be caught. In practice, most genuine near-duplicate vendor name variants are alphabetically close (e.g. Capita variants all start with C), so even a window of 75 captures the majority of true clusters.
This entire section is optional. If you do not have an inactive records file, or do not
need to distinguish between active and inactive matches, leave all six cells blank.
The engine will skip Step 2b automatically and the Inactive_Unique sheet will
not be created. The MATCH_NOTE column in output will contain only
"exact match" or "no match found".
Folder path of the inactive records file. Same format as C4. Leave blank if the file is a sheet in this workbook.
Exact file name including extension. This is the master switch for Step 2b. If this cell is blank, the engine ignores all other settings in this section and skips Step 2b entirely — whether you run it individually, via Run All, or via Quick Rerun. You do not need to configure or clear the other cells in this section to suppress Step 2b; simply leaving C29 blank is sufficient.
Sheet tab name within the inactive file. Leave blank to default to the first sheet. The engine automatically detects merged title rows in row 1 — if row 1 appears to be a report header rather than column headers, it advances to row 2 before scanning. You do not need to modify the file before running.
The column containing the name of the deactivated entity. Header name, column letter,
or column number. Examples: Vendor Name, SUPPLIER_NAME,
B, 2.
The column containing the system identifier of the deactivated entity. Header name,
column letter, or number. Examples: Vendor Number, SUPPLIER_ID.
The column containing the date the record was deactivated. Header name, column letter, or
number. Examples: Vendor Inactive Date, INACTIVE_DATE,
Deactivation Date.
The date value is read as-is and stored without reformatting. It appears in the
VENDOR_INACTIVE_DATE column of Inactive_Unique, VN_Matched,
and VN_Review, giving the analyst visibility of when the entity was retired.
These two settings control the colour-coding applied to summary rows in the VN_Review sheet. They let you tune the visual priority system to your dataset's characteristics.
2)A cluster's summary row in VN_Review is highlighted (amber) when the cluster contains this many or more distinct system identifiers matched across its variants. A cluster with two different vendor numbers for what appears to be the same supplier is a potential duplicate supplier setup and warrants review.
Setting this to 1 would highlight every cluster that has any match at all — producing too much noise in most datasets. The default of 2 means you only see amber rows when there is a genuine multi-ID situation.
Raising this value (e.g. to 3) means only clusters with three or more distinct IDs receive a highlight. Useful if your dataset legitimately contains a large number of single-company groups that have two Oracle records for different legal entities.
7)Within already-amber clusters (those meeting the C36 threshold), the highlight is escalated to red if the total occurrence count across all variants in the cluster is at or above this value. Red rows in VN_Review represent the highest-risk items — high-volume, multi-ID clusters with many transactions potentially split across duplicate system records.
Lowering this value produces more red rows — more items are treated as high priority. Useful for datasets with a high proportion of low-volume transactions, where even 3–4 occurrences indicates a recurrent issue.
Raising this value restricts red rows to only the highest-volume clusters. Useful when the dataset is very large and you want the analyst to focus exclusively on the very top items.
These two settings provide defaults for the Archive dialog that appears when you click Archive Results on the Run sheet. They can be changed directly in the dialog at archive time — updating them there also saves the new values back to these cells automatically, so they become the new default for the next run.
Default folder path where archive files are saved. Examples:
C:\AP\Review Archives\\server\AP\VNE Archives\2026If the specified folder does not exist when Archive Results is run, the macro will attempt to create it. If it cannot (e.g. no write permission to the network path), you will receive an error and be prompted to choose a different location. Leave blank to be prompted with an empty path in the Archive dialog each time — you will then need to type or browse to the destination manually on every archive run.
The base name used to construct the archive file name. The macro appends the run date and a version suffix automatically:
BaseName_DD_MMM_YYYY.xlsx
BaseName_DD_MMM_YYYY_V2.xlsx (if V1 already exists)
Choose a base name that identifies the dataset and context. Examples:
| Context | Suggested base name | Resulting file name |
|---|---|---|
| AP supplier review | VNE_Review |
VNE_Review_14_MAR_2026.xlsx |
| Quarterly full refresh | VNE_Q1_2026 |
VNE_Q1_2026_14_MAR_2026.xlsx |
| Specific operating unit | VNE_EMEA_Suppliers |
VNE_EMEA_Suppliers_14_MAR_2026.xlsx |
| Manufacturer review project | MFR_NormReview |
MFR_NormReview_14_MAR_2026.xlsx |
Avoid characters that are not valid in Windows file names: \ / : * ? " < > |.
Spaces are permitted but underscores are recommended for compatibility with systems that
do not handle spaces in file paths.
For users who want to get running as quickly as possible, the following three cells are the absolute minimum required before Step 1 and Step 3 can execute:
| Cell | What to enter | Why it's required |
|---|---|---|
C7 |
Header of the name column in your source file | Step 1 cannot identify which column to read without this. |
C13 |
Header of the identifier column in your reference file | Step 2 cannot build Oracle_Unique without this. |
C14 |
Header of the name column in your reference file | Step 2 cannot build Oracle_Unique without this. |
All path fields (C4, C5, C10, C11) can be left blank if your source data is already on sheets inside this workbook. All other settings have functional defaults and do not need to be populated before your first run.
The Run sheet is the primary entry point. Every macro function is accessible from here. Column B shows the button label, column C the description, and column D the macro being called.
| Button | What it does | When to use it |
|---|---|---|
| Open Wizard | Opens the configuration wizard form (frmVNE). Allows you to browse to source files, map column headers, choose which steps to run, and save the settings to the Config sheet — all from a single guided dialog. | First time setup, or when source files have changed and you need to update mappings. |
| Quick Rerun | Runs Steps 1 and 3 (Clustering + Match) back-to-back immediately, reading all settings from the Config sheet. No form is displayed. | Daily re-run with a new vendor name batch when Oracle data has not changed. |
| Step 1 — Name Clustering | Runs the clustering algorithm only. Reads the vendor name source, groups similar names, and writes VN_Clusters. | When you need to re-cluster after the vendor name file changes, without re-running the match (e.g. to check clustering quality before committing a full run). |
| Step 2 — Oracle Reduce | Reads the Oracle extract, deduplicates at the Vendor Number + Vendor Name level, and writes Oracle_Unique. | When a new quarterly Oracle extract arrives. |
| Step 2b — Inactive Vendor Reduce | Reads the Inactive Vendors file and writes Inactive_Unique. Skipped automatically if no inactive file is configured. | When a new quarterly inactive vendors report arrives. |
| Step 3 — Vendor Match | Matches each cluster against Oracle_Unique (active vendors) and then against Inactive_Unique (if available). Writes VN_Matched and VN_Review. | After clustering is complete, to produce the final review sheets. |
| Run All (Steps 1 + 2 + 2b + 3) | Runs all four steps in sequence with a confirmation prompt. Step 2b is skipped if not configured. This is equivalent to clicking each step button in order. | Full refresh when both the vendor name source and Oracle data have changed. |
| Setup Workbook | Re-creates or refreshes the Run, Config, and Log sheets. Safe to re-run at any time. | First-time setup, or after importing the macro modules into a new workbook. |
| Archive Results | Exports all output sheets (plus a Summary tab) to a clean, macro-free
.xlsx file for handoff to the junior analyst. |
When the first-pass review is complete and the data is ready to hand off. |
Step 1 reads the vendor name source, groups similar names into numbered clusters, and writes the results to the VN_Clusters sheet.
| Column | Description |
|---|---|
CLUSTER_ID |
An integer that uniquely identifies a group of similar names. All rows sharing the same CLUSTER_ID were determined to be similar enough to represent the same vendor. |
VENDOR_NAME |
The original vendor name as it appeared in the source file, with no modification. |
OCCURRENCE_COUNT |
How many times this exact name appeared in the source file. |
RANK |
Rank within the cluster by occurrence count. Rank 1 = most frequent name in the cluster, which is used as the cluster's representative name in downstream sheets. |
CLUSTER_SIZE |
Total number of distinct name variants in this cluster (same value on every row of the cluster). A CLUSTER_SIZE of 1 means the name appeared in only one form — no duplicates were detected. |
The engine uses a sorted sliding-window union-find approach:
For a dataset of approximately 5,000 unique names with the default window of 150, Step 1 typically completes in 3–8 minutes depending on hardware. The progress bar shows elapsed time and percentage completion. The bar resets to 0% between the two clustering passes — this is normal behaviour.
Step 2 reads the Oracle vendor master extract and collapses it to one row per unique Vendor Number + Vendor Name combination, writing the result to the Oracle_Unique sheet.
The raw Oracle extract typically contains one row per transaction or account line, meaning the same Vendor Number + Vendor Name pair may appear thousands of times. Step 3's matching logic requires a clean, non-redundant reference list. Step 2 reduces a file that might contain 500,000 rows to a manageable list of perhaps 50,000–60,000 unique vendor records.
| Column | Description |
|---|---|
VENDOR_NUMBER |
The Oracle vendor number as it appeared in the source file. Leading zeros are preserved. |
VENDOR_NAME |
The Oracle vendor name associated with this vendor number in the extract. |
OCCURRENCE_COUNT |
How many times this Vendor Number + Vendor Name combination appeared in the source extract. |
Step 2b reads the Inactive Vendors file (a separate Oracle report listing deactivated vendors) and writes one row per unique Vendor Number + Vendor Name pair to the Inactive_Unique sheet.
If cell C29 (Inactive File Name) on the Config sheet is blank, Step 2b is
automatically skipped — both when run individually and when triggered by Run All.
The VN_Matched and VN_Review sheets produced by Step 3 will omit the
inactive-vendor matching phase (Phase B), and the VENDOR_INACTIVE_DATE
column will be blank for all rows.
Some Oracle inactive vendor reports have a merged title row in row 1, with actual column headers in row 2. The engine detects this automatically — if row 1 appears to be a single merged cell or does not contain recognisable header values, it advances to row 2 before scanning for headers.
| Column | Description |
|---|---|
VENDOR_NUMBER |
Oracle vendor number. |
VENDOR_NAME |
Vendor name as it appears in the inactive report. |
VENDOR_INACTIVE_DATE |
The date on which the vendor was deactivated in Oracle. |
Step 3 takes every cluster from VN_Clusters and attempts to find a matching Oracle vendor record for each name variant within the cluster. It writes two output sheets:
For each name variant within a cluster, the engine attempts to find a match in two phases:
Produced by Step 1. One row per unique vendor name variant. AutoFilter enabled. Use this sheet to verify clustering quality — filter by CLUSTER_SIZE > 1 to see all multi-variant groups, or filter by a specific CLUSTER_ID to see all variants in one cluster.
Produced by Step 2. One row per unique Vendor Number + Vendor Name pair in Oracle. AutoFilter enabled. Use this as a reference when cross-checking vendor numbers in the review process.
Produced by Step 2b. One row per unique inactive Vendor Number + Vendor Name pair.
AutoFilter enabled. Columns: VENDOR_NUMBER, VENDOR_NAME,
VENDOR_INACTIVE_DATE.
Produced by Step 3. One row per name variant × Oracle match combination. This is the flat, unformatted version of the results — it contains every row including variants with no Oracle match. AutoFilter enabled.
| Column | Description |
|---|---|
CLUSTER_ID |
Cluster number from Step 1. |
CLUSTER_REP_NAME |
The most-frequently-occurring name in the cluster (Rank 1 from VN_Clusters). |
CLUSTER_TOTAL_COUNT |
Total occurrence count across all name variants in the cluster. |
VENDOR_NUMBER |
Oracle vendor number found by Phase A or B. Blank if no match. |
ORACLE_VENDOR_NAME |
The vendor name as it appears in Oracle_Unique or Inactive_Unique. Blank if no match. |
MATCH_NOTE |
exact match — active Oracle vendor found inactive vendor — found in Inactive_Unique only no match found — not found in either reference |
VENDOR_INACTIVE_DATE |
Inactive date from Inactive_Unique. Populated only for "inactive vendor" rows. |
VN_Review is the primary working sheet for the AP analyst. It shows only multi-member clusters (CLUSTER_SIZE > 1) — single-name clusters are excluded as they require no review. It presents results in a collapsed, grouped view that allows the analyst to triage quickly at the cluster level and then drill into individual variants as needed.
Each cluster is represented by two types of rows:
| Colour | Meaning |
|---|---|
| White / alternating blue — summary rows | Standard multi-variant cluster. Contains spelling variants but no flags have been triggered. |
| Amber / yellow — summary row | The cluster has 2 or more distinct Oracle vendor numbers (configurable via
C36). This signals that what may be one real-world vendor is recorded
under multiple Oracle IDs — a potential duplicate vendor setup. |
| Red — summary row | The cluster is amber-flagged AND the total occurrence count is at or above the
high-volume threshold (default 7, configurable via C37). These are
high-priority clusters for immediate review. |
| Gray — detail rows | Standard variant rows within a cluster. |
When the first-pass review on VN_Review is complete and the data is ready for granular verification by the junior analyst, use the Archive Results button to export a clean handoff package.
The archive is a standard Excel .xlsx workbook with no macros. It contains:
The Run, Config, and Log sheets are excluded — the junior analyst does not need configuration data or the macro infrastructure.
| 1 | Click Archive Results on the Run sheet. If VN_Review does not exist (Steps 1 and 3 have not been run), the macro will display an error and stop. |
| 2 |
The Archive dialog opens, pre-filled with the save folder and base file
name from the Config sheet (C40/C41).
|
| 3 | Click Archive. The macro copies the output sheets, builds the Summary tab, and saves the file. A confirmation message shows the full path of the saved file. |
The archive file name is constructed as follows:
BaseName_DD_MMM_YYYY.xlsx
For example: VNE_Review_14_MAR_2026.xlsx
If that file already exists in the chosen folder, a version suffix is appended automatically:
VNE_Review_14_MAR_2026_V1.xlsx
VNE_Review_14_MAR_2026_V2.xlsx (if V1 also exists)
Your chosen folder and base name are saved back to the Config sheet for use as the default in the next archive run.
The Summary tab is generated automatically and placed as the first sheet in the archive workbook. It is designed to give the junior analyst an immediate orientation to the dataset they are about to review, without requiring them to open any other tab first.
A statistics block covering:
A guide section explaining each tab the analyst will encounter, with a plain-language description of its content and purpose.
A table listing the 10 clusters with the most distinct name variants. Each CLUSTER_ID in this table is a hyperlink — clicking it navigates directly to that cluster's summary row in the VN_Review tab, allowing the analyst to jump straight to the highest-complexity items.
Clusters that meet the amber threshold are highlighted in amber. Rows in the table that also meet the red threshold are highlighted in amber (the full red treatment is on VN_Review itself).
Three charts are generated automatically:
A chart data area (columns H–I on the Summary sheet) contains the raw numbers behind each chart in tabular form. The CLUSTER_ID values in this table are also hyperlinked to VN_Review.
While any step is running, the progress bar form displays a Cancel button. Clicking it sets an internal abort flag. The engine checks this flag at the end of each step — the current step will complete normally, and the engine will stop before starting the next step.
If a run is cancelled mid-way through a multi-step sequence (e.g. Quick Rerun cancels after Step 1 completes but before Step 3), the VN_Clusters sheet will have been updated with fresh Step 1 results, but VN_Matched and VN_Review will still reflect the previous run. Re-run Step 3 individually to bring those sheets up to date.
The Log sheet records a timestamped entry for every macro run. It is written automatically and does not require any action from the user.
| Column | Content |
|---|---|
Timestamp |
Date and time the entry was written. |
Source |
The module that wrote the entry (e.g. VNE5_NameCluster,
VNE9_Archive). |
Message |
A summary of the outcome, e.g. OK: 4604 clusters, 5434 rows processed, or an error description if the step failed. |
This section describes the internal mechanics of the clustering algorithm for users who need to interpret unusual results or adjust configuration parameters.
Before any comparison, each vendor name is normalised:
CAPITA PLC, Capita Ltd, and CAPITA LIMITED all normalise to
CAPITA and therefore match exactly in Step 3.
The score between two normalised names is computed as follows:
Without this guard, a chain of transitively similar names can collapse into a single enormous cluster: if A ~ B and B ~ C but A and C are unrelated, all three would merge. The VNE prevents this by requiring that the cluster roots (the representative names of each cluster being merged) also score above threshold. Two single-name entries are always allowed to merge — the guard only applies when at least one side is already a multi-member cluster.
If you believe two names should be in the same cluster but are not, possible causes are:
C23 to 0.80 and
re-running Step 1 — but validate that this does not create false merges elsewhere.C25.If a single cluster contains names that appear unrelated, possible causes are:
C23 to 0.87 or 0.90.You clicked Archive Results but Step 3 has not been run yet. Run Step 3 (or Quick Rerun) before archiving.
The workbook has not been set up. Press Alt + F8, select
VNE3_Setup.SetupWorkbook, and run it.
Cell C7 on the Config sheet is blank. Enter the exact column header of the
vendor name column in your source file (e.g. VENDOR_NAME) and try again.
The file path and/or file name on the Config sheet does not point to an existing file. Verify that the folder and file name are correct, that the file is not currently open in another Excel instance, and that you have read access to the network location.
This is normal behaviour during Step 1. The clustering algorithm runs two passes over the data (Pass A and Pass B). The progress bar completes 0→100% for each pass. The brief reset between passes is intentional — it indicates that the first pass completed successfully and the second pass has started.
Processing time for Step 1 scales with the number of unique vendor names × the window size.
For datasets above 10,000 unique names, consider reducing the window size from 150 to 75 in
C25. This reduces recall (some distant variant pairs may not be caught) but
significantly reduces processing time. For very large datasets (>30,000 unique names),
reduce to 50.
Step 3 filters VN_Review to show only clusters with CLUSTER_SIZE > 1. If every name in the source file is unique (no repeated or similar vendor names), all clusters will have size 1 and VN_Review will be empty. Check VN_Clusters to confirm whether any multi-member clusters were produced.
This typically means Oracle_Unique was produced from a different Oracle extract than the vendor name source — the vendor names in the source do not correspond to any active Oracle records in the current Oracle extract. Verify that Steps 2 and 3 are reading from the correct files and that the column headers are mapped correctly.
The date in the archive file name is the date and time the archive macro was run —
not the date of the underlying data. If you need the data date in the file name, edit
the base file name in the Archive dialog to include it manually, e.g.
VNE_Review_Q1_2026.
CLUSTER_ID values are assigned sequentially in the order clusters are first encountered during output writing. They are not stable identifiers — the same group of vendor names may receive a different CLUSTER_ID in a subsequent run if the sort order or grouping changes. Do not use CLUSTER_ID as a persistent key in external systems.
If a vendor has been deactivated in Oracle but the current Oracle_Unique sheet was produced before the deactivation (i.e. Step 2 was last run before the vendor was deactivated), the vendor will still appear as active in the reference data. Re-run Step 2 with the latest Oracle extract to resolve this.
| Term | Definition |
|---|---|
| Cluster | A group of vendor name variants that the algorithm has determined are similar enough to represent the same real-world vendor. Identified by a CLUSTER_ID. |
| Cluster size | The number of distinct name variants in a cluster. CLUSTER_SIZE = 1 means only one spelling was found; CLUSTER_SIZE = 12 means twelve different spellings were grouped together. |
| Representative name | The name variant within a cluster with the highest occurrence count (RANK = 1). Used as the cluster label in VN_Matched and VN_Review. |
| Exact match | A vendor name variant whose normalised form matched a record in Oracle_Unique exactly. Displayed in MATCH_NOTE as "exact match". |
| Inactive vendor | A vendor name variant that matched a record in Inactive_Unique but not in Oracle_Unique. The vendor exists in Oracle but has been deactivated. |
| No match found | A vendor name variant that did not match any record in Oracle_Unique or Inactive_Unique after normalisation. May indicate a new vendor, a severe abbreviation, or a data entry error. |
| Complete-linkage guard | An internal safeguard that prevents the clustering algorithm from forming extremely large catch-all clusters by requiring that cluster roots (not just leaf members) also score above the similarity threshold before a merge is permitted. |
| Sliding window | The technique used in Step 1 to limit the number of pair comparisons. Only names within a window of N adjacent entries in the sorted list are compared, rather than every possible pair (which would be computationally prohibitive). |
| Oracle extract | A full export of the Oracle vendor master, typically produced quarterly by the Finance or IT team. Used in Step 2 to build Oracle_Unique. |
| Inactive vendors file | A separate Oracle report listing vendors with a deactivation date. Used in Step 2b to build Inactive_Unique. |
| Archive workbook | The macro-free .xlsx file produced by the Archive Results function.
Intended for distribution to the junior analyst for granular review and manual
verification against Oracle. |
| Quick Rerun | A single-click entry point that runs Steps 1 and 3 back-to-back using the settings already saved on the Config sheet. No form is displayed. |
| Token-sort key | A form of the vendor name where the individual words (tokens) have been sorted alphabetically before being re-joined. Used in Pass B of the clustering algorithm to catch names whose words appear in different orders. |