Disambiguating Organizations and Groups
Organizations (legal entities that have been formed for business or social purposes) and Groups (groups of people formed together for a specific purpose) are used throughout GRC and SecOps software.
- They are used to define owning organizations of Accounts within the various GRC and SecOps products.
- They are used to define Contributors to content within GRC and SecOps products.
- They produce Assets that need to be configured.
- They can be named as Threat Actors (either as organizations or groups).
The Use of Organizations and Groups in various GRC and SecOps schemas
We’ve examined the various standards we have available to us, and we have to tell you, there’s a lot of room for ambiguity here.
StratML
StratML doesn’t have an API to find its content, and therefore, there really isn’t much information about organizations. Basically they track an organization’s name and description.
OSCAL
OSCAL tracks organizations as parties, and has more information than StratML, but not much more. They track a full name as well as a short-name (Apple Computer, Inc. vs. Apple), as well as the addresses and phone numbers of the organization.
STIX
STIX tracks as organization as an identity and than has contact information for that identity that can be mapped to addresses and phone numbers. They also have an identity class which we can’t really find much information about what type of categorization that belongs to.
OASIS LegalXML
OASIS’ LegalXML tracks an organization and its various legal names, but provides little else.
ISO 19770-3
ISO tracks about the same as many of the others, including physical address and primary email address.
CPE v2.3
CPE version 2.3 only tracks the name of an organization. And many times it gets that wrong too.
SWID
The updated version to CPE is the Sofware Identifier (SWID). This, too, only tracks organization names.
VERIS
VERIS either tracks organizations and groups as actors or victims. They track physical addresses of both (where they have them) as well as the industry (we don’t know which system they use) the actor or victim (if they are organizations) fall into.
Manual Disambiguation is arduous
Manual disambiguation is arduous and tedious. And because there are so many (relatively) inexpensive APIs out there, why do it manually? To disambiguate organizations manually without relying on external systems like ClearBit or ZoomInfo, consider the following strategies:
Use Unique Identifiers:
- Assign a unique internal identifier to each organization in your database.
- Use universally recognized identifiers such as tax identification numbers (TINs), employer identification numbers (EINs), or DUNS numbers.
Enhanced Metadata:
- Collect and store additional metadata for each organization, such as:
- Industry sector
- Headquarter location
- Year of establishment
- Primary contact details
Contextual Keywords:
- Implement a tagging system with contextual keywords related to the organization's activities, products, or services.
Structured Naming Conventions:
- Develop and enforce a consistent naming convention that includes the organization's full legal name, any commonly used acronyms, and location-specific identifiers.
Data Normalization and Standardization:
- Apply data normalization techniques to standardize organization names and other attributes (e.g., converting all names to a consistent format like uppercase or lowercase).
Cross-referencing:
- Cross-reference organizations with publicly available databases, such as government registries or industry directories, to verify and update information.
Machine Learning Algorithms:
- Use machine learning algorithms to identify and merge duplicate records by analyzing patterns in the data (e.g., name similarity, address matching).
Human Review and Verification:
- Implement a manual review process where discrepancies or potential duplicates are flagged for human verification.
Integrate with CRMs and ERPs:
- Synchronize with Customer Relationship Management (CRM) and Enterprise Resource Planning (ERP) systems to ensure consistent data across platforms.
Data Enrichment:
- Utilize internal data enrichment processes by leveraging existing data from different departments or functions within your organization.
Geocoding:
- Use geocoding to add geographical coordinates to organization addresses, which helps distinguish between entities with similar names in different locations.
Regular Audits and Data Cleansing:
- Conduct regular audits and data cleansing operations to maintain data quality and ensure that any ambiguities are promptly addressed.
Implementing a combination of these methods will help enhance the accuracy of organization data in your GRC and SecOps software, reducing reliance on external services – but only if you have enough manpower and whiskey to withstand the onslaught of work you are about to undertake.
Disambiguation services
Several tools can help disambiguate organizations and groups by providing detailed information and ensuring data accuracy. Here are some of the best tools for this purpose:
Clearbit
- Provides real-time data enrichment, including company details, employee information, and firmographics.
- Integrates with various platforms like CRMs, marketing tools, and custom applications.
ZoomInfo
- Offers comprehensive business intelligence data, including company insights, employee information, and market analysis.
- Features tools for lead generation, data enrichment, and account management.
Dun & Bradstreet (D&B)
- Provides extensive business information through its D-U-N-S Number system, which uniquely identifies business entities.
- Includes credit reports, risk assessments, and market intelligence.
LinkedIn Sales Navigator
- Utilizes LinkedIn's vast network to provide detailed company and professional profiles.
- Offers advanced search and filtering options to find and verify organizational details.
DiscoverOrg (now part of ZoomInfo)
- Delivers detailed company profiles, including organizational charts, technology stacks, and decision-maker contact information.
- Helps with market segmentation, lead generation, and sales prospecting.
InsideView
- Provides real-time company data, market insights, and news alerts.
- Integrates with CRM systems to enrich contact and account data.
Hoover’s (D&B Hoovers)
- Offers comprehensive business information and industry analysis.
- Includes tools for prospecting, market research, and competitive intelligence.
LeadGenius
- Combines machine learning with human intelligence to deliver accurate company data and insights.
- Focuses on data enrichment, lead generation, and market research.
Owler
- Provides crowdsourced business insights, competitive intelligence, and company profiles.
- Includes tools for tracking competitors and industry trends.
SalesIntel
- Offers verified B2B contact data, including company information and decision-maker contacts.
- Focuses on accuracy through human verification and regular data updates.
UpLead
- Provides a database of verified B2B contacts and company information.
- Features data enrichment, email verification, and lead generation tools.
Seamless.AI
- Uses artificial intelligence to deliver accurate contact and company data.
- Includes tools for prospecting, data enrichment, and CRM integration.
FullContact
- Offers contact and company data enrichment, including social profiles and demographic information.
- Integrates with various platforms for seamless data synchronization.
DataFox (Oracle)
- Provides AI-driven company insights and real-time data enrichment.
- Features tools for account scoring, market intelligence, and lead generation.
Data.com (formerly Jigsaw, now integrated with Salesforce)
- Provides business contact data and company profiles.
- Integrates seamlessly with Salesforce for data enrichment and lead generation.
These tools can significantly enhance your ability to disambiguate organizations and groups by providing detailed, accurate, and up-to-date information. Selecting the right tool will depend on your specific needs, such as integration capabilities, data accuracy, and the scope of information required.
Here are some of the caveats you’ll need to embrace and overcome if you want to track Organizations and Groups using the Common Data Format.
Caveat 1 - Multiple Industry Classification Codes
Several widely used industry classification standards help organizations and analysts categorize businesses, professionals, computer systems, and other products based on their activities and functions. These standards include the North American Industry Classification System (NAICS)1, Standard Industrial Classification (SIC)2, Global Industry Classification Standard (GICS)3, and United Nations Standard Products and Services Code (UNSPSC)4. There is also a non-governmental organization called the Institute for Public Procurement that runs the National Institute of Governmental Purchasing (NIGP)5 which purports to be universal taxonomy for identifying commodities and services in procurement systems. In addition, both LinkedIn6, and Clearbit7 maintain their own industry classification codes.
- NAICS codes are used in the Economic Census as well as in a number of different databases (Business Source Complete, Mergent Online, Ibis World, etc. all use NAICS). While the NAICS system is more up-to-date than the SIC system, many databases still use SIC codes.
- GICS is used by investors and analysts to identify, compare, and contrast a firm's competitors.
- Many US State and foreign national entities use either the NIGP or the UNSPSC to standardize the description and classification of the goods and services purchased by those governmental organizations. This level of standardization enables the Procurement Division to effectively analyze, Strategical Source goods and services, and fully utilize the electronic commerce (eCommerce) capabilities used in today’s marketplace. O*Net uses the UNSPSC for categorizing products, services, and job descriptions in their employment definitions8. UNSPSC is also built into the Common Platform Enumerator (CPE)9 and Software Identification Tags (SWID)10 used within Secure Technical Implementation Guides11.
- Both LinkedIn and Clearbit are used in record enrichment of organizations and people for marketing and other purposes.
These classification standards typically organize industries, products, and services in a hierarchical structure with multiple levels of detail. At the top of the hierarchy are parent categories representing broader industry groups. Next, parent categories can have child categories, representing more specific industry segments. Finally, child categories can have sub-child categories, and this structure can continue with additional levels of detail as needed. Relationships between these categories can help to understand how industries are related or connected.
Let us take a closer look at each of the classification systems:
- NAICS: The North American Industry Classification System (NAICS) is a standard for classifying businesses based on economic activities developed by the United States, Canada, and Mexico. Every five years, the government updates NAICS codes, which organizations use for various purposes such as statistical analysis, regulatory compliance, and market research. The NAICS hierarchy follows a 2–6-digit code structure, with each level providing more specific industry details.
- SIC: The United States-based classification system, the Standard Industrial Classification system, categorizes businesses into industry groups based on their primary economic activities. Although NAICS has largely replaced it, some contexts still use SIC codes. The SIC system employs a 4-digit code structure. The first two digits represent major industry groups; the remaining two provide additional details.
- GICS: The Global Industry Classification Standard, developed by MSCI and S&P Global, is an industry taxonomy designed for investment research and portfolio management. GICS divides the global economy into 11 sectors, 24 industry groups, 69 industries, and 158 sub-industries. Companies are assigned a unique GICS code based on their primary business activities.
- UNSPSC: The United Nations Standard Products and Services Code is a global classification system for products and services used in eCommerce, procurement, and supply chain management. The UNSPSC hierarchy comprises segments, families, classes, and commodities, providing a standardized framework for classifying goods and services across various sectors.
- NIGP: Periscope Holdings is proud manages and licenses the National Institute for Government Purchasing’s NIGP Commodity/Services Code and NIGP Consulting program. These offerings are designed specifically for public procurement organizations.
- LinkedIn: The professional networking platform LinkedIn uses its proprietary industry classification system to categorize companies and professionals based on their activities and sectors. While not as detailed as other classification systems, LinkedIn’s taxonomy allows for easier networking and discovery of professionals and businesses within specific industries. LinkedIn’s classification system includes over 140 industries, which users can choose from when creating or updating their profiles.
- Clearbit: Clearbit, a company specializing in data enrichment and lead generation, provides an industry classification system that helps with marketing automation, sales intelligence, and customer data management. Clearbit’s taxonomy uses multiple data sources to map companies to sectors and sub-sectors, giving a more detailed view of the market landscape. However, it is not as widely adopted as other classification systems.
The Problem
The problem, of course, is that each of these methodologies is designed without a great deal of thought about interaction with the other. And each of these methodologies is designed without a public-input curation methodology. None of these methodologies is tied to a standardized dictionary. And none of these methodologies has a complete technique for exporting between formats.
What follows is a table of which APIs map to which standards:
Here is a table summarizing which classification codes each product uses:
Product | NAICS | SIC | GICS | UNSPSC | NIGP | ClearBit | |
---|---|---|---|---|---|---|---|
Clearbit | Yes | No | No | No | No | Yes | N/A |
ZoomInfo | Yes | Yes | No | No | No | Yes | No |
Dun & Bradstreet | Yes | Yes | No | No | No | Yes | No |
LinkedIn Sales Nav. | Yes | No | No | No | No | Yes | No |
DiscoverOrg | Yes | Yes | No | No | No | Yes | No |
InsideView | Yes | Yes | No | No | No | Yes | No |
Hoover’s (D&B Hoovers) | Yes | Yes | No | No | No | Yes | No |
LeadGenius | Yes | Yes | No | No | No | Yes | No |
Owler | Yes | Yes | No | No | No | Yes | No |
SalesIntel | Yes | Yes | No | No | No | Yes | No |
UpLead | Yes | Yes | No | No | No | Yes | No |
Seamless.AI | Yes | Yes | No | No | No | Yes | No |
Data.com (Salesforce) | Yes | Yes | No | No | No | Yes | No |
FullContact | Yes | No | No | No | No | Yes | No |
DataFox (Oracle) | Yes | Yes | No | No | No | Yes | No |
Legend
- NAICS: North American Industry Classification System
- SIC: Standard Industrial Classification
- GICS: Global Industry Classification Standard
- UNSPSC: United Nations Standard Products and Services Code
- NIGP: National Institute of Governmental Purchasing
- LinkedIn: LinkedIn Industry Categories
- ClearBit: Refers to integration with ClearBit data services
This table provides a clear overview of which classification codes each product uses or supports, helping you choose the right tool based on your specific classification needs.
The problem for GRC and SecOps tools is one of assigning an industry code to technological products that use either the CPE or SWID system to identify hardware and software products. This problem comes in to play when you get to mapping actual products in to the SIC, NAICS, and UNSPSC solutions. Or any non-technical product into the CPE solution. The CPE and SWID standards contain product names, versions, and a well-thought-out host of other meta data. However, SIC, NAICS, and UNSPSC solutions cannot allow for products to be added to their solutions because their identification methodologies prevent them from doing so. The UNSPSC solution only allows for 100 items per sub-distinction. That doesn’t work.
Caveat 2 - Normalizing naming records
Names are not normalized across the GRC and SecOps records communities.
- STIGs rely on CPE organization naming conventions.
- OSCAL-based System Security Plans rely on SWID organization naming conventions which are UNSPSC based.
- STIX,TAXII, VERIS, StratML, OASIS, and the rest having their own organization and group naming conventions.
UNSPSC records have several non-normalized problems associated with them.
- The UNSPSC ID is not a primary key ID in the truest sense; meaning that if the record is moved within the hierarchy, the ID will change and will not move with the record. Therefore, a permanent, primary key ID must be associated with the current UNSPSC record so that if the record is moved in the hierarchy, the primary key ID can be associated and moved with it even though the UNSPSC ID changes.
- The UNSPSC hierarchy is not a standardized hierarchy of parent IDs related to children IDs. Currently it is simply a numeric extension of the UNSPSC ID. Therefore, an external hierarchical ID system and sort system must be put into place.
- The naming conventions outlined in https://catmaster.unspsc.org/Help/index.html are not always followed. Capitalization and pluralization must be reset.
- There is no easy differentiation between a product and service segment, category, etc. within the UNSPSC. Services are supposed to lie within a certain numeric range. However, not all services either do, nor should they lie within that range. Therefore, a hierarchical level denominator that distinguishes between products and services, should be tracked as well.
- Not all records have definitions. Therefore, definitions must be added according to a standardized methodology such as the Common Asset Enumerator definition algorithm mentioned above.
CPE records have several non-normalized problems associated with them.
- There is no identification system, other than the CPE name associated with CPE records. There is no concept of a primary key within the CPE system. Therefore, a permanent, primary key ID must be associated with the current CPE record so that if the record is moved in the hierarchy, the primary key ID can be associated and moved with it even though the CPE ID changes.
- The CPE hierarchy is based upon a parsing of the CPE name. Therefore, an external hierarchical ID system and sort system must be put into place.
- Not all records have definitions. Therefore, definitions must be added according to a standardized methodology such as the Common Asset Enumerator definition algorithm mentioned above.
As for the others, there doesn’t seem to be much rhyme or reason for Organization and Group naming and we could find no naming standards for them.
UCF API as a proposed solution
An alternative is to use the UCF API.
- Uses artificial intelligence to deliver Organizational and Group data that has already been cross-referenced between the various category systems. Members of the UCF team have published a highly peer-reviewed paper on the subject HERE.
- Uses a suite of artificial intelligence algoruthms to normalize names and produce a name list for all possible names the organization or group might be referenced under.
- The output of the UCF API matches the Common Data Format for both Organizations and Groups.
Footnotes
- “NAICS & SIC Identification Tools.” ↩
- “SEC.Gov | Division of Corporation Finance: Standard Industrial Classification (SIC) Code List.” ↩
- “GICS - Global Industry Classification Standard.” ↩
- “UNSPSC Home.” ↩
- “Https.” ↩
- pema-s, “Industry Codes V2 - LinkedIn.” ↩
- “Clearbit API Documentation For Developers.” ↩
- “UNSPSC Reference - O*NET 20.1 Data Dictionary at O*NET Resource Center.” ↩
- “NVD - CPE.” ↩
- “NVD - SWID”; Waltermire and Cheikes, “Enumeration (CPE) Names from 19 Software Identification (SWID) Tags 20 2.” ↩
- “MetaBuilder (SWID Tag Generation Java API 0.6.1 API).” ↩