08 Feb 2023 Data Governance with Apache Atlas: Custom Types in Atlas (Part 3 of 3)
In the first blog post in this series we introduced you to Apache Atlas, the open-source solution for data governance, and ran through the details of its web user interface, capabilities, and architecture. The second blog post in the series looked at Atlas implementation in the leading data platform, Cloudera. Today, let’s dive even deeper and learn how to leverage Atlas to build services that are specific to our custom use case.
Apache Atlas Type Definition
We explained Entity Type in our first blog post – similar to a class, it is a form of specification of a metadata object, as opposed to an Entity in Atlas, which is an instance of the metadata object. Like the concept of inheritance in object-oriented programming, we inherit our custom type from another parent entity type. Below we can see the available system basic types that are pre-defined in Atlas:
Figure 1: Base Type System in Atlas
- Referenceable – This type represents all entities that can be found using the special qualifiedName attribute.
- Asset – This type extends Referenceable and adds attributes like name, description, and owner. Name is a required attribute (isOptional=false), whilst the others are optional. The purpose of Referenceable and Asset is to provide modellers with a way of enforcing consistency when defining and querying entities of their types. Having these fixed sets of attributes allows applications and user interfaces to make convention-based assumptions about what default attributes they can expect of types.
- DataSet – Referenceable and Asset are expanded by this type; it can conceptually stand in for a type that stores data. Hive tables, Sqoop RDBMS tables, and other types all extend from DataSet in Atlas. It is reasonable to assume that types that extend DataSet have a schema in the sense that they have an attribute that describes the dataset’s attributes.
- Infrastructure – This type, which extends Referenceable and Asset, is frequently used as a supertype for infrastructure-related metadata objects such as clusters and hosts.
- Process – Referenceable and Asset are supplemented by this type; it can conceptually represent any data transformation process. Inputs and outputs are two distinct characteristics of a process type.
In the class diagram above, we can see that the type system of Atlas is extendable. New custom kinds inherited from several types can be added to the current Atlas data model.
Building a Custom Type
Figure 2: cp_person Entity Object
Now let’s take a detailed look at a basic model so that we can build on it and scale the implementation of a custom type. For the sake of simplicity, let’s base our use case around a very popular software most of us work with every day: Microsoft Excel.
We are all familiar with Excel. Every Excel workbook has at least one worksheet on which we proudly present tabular data. Each workbook is then saved in a folder in our file system. That’s our use case!
We have a system instance called cp_system that is owned by a person (aka cp_person) and contains folders of type cp_folder where we store files of type cp_files, each of which includes worksheets of type cp_worksheet that enable us to store tabular data with columns of type cp_column. The entities are also connected, with a different name for each link.
Building an Object Model
Figure 3: cp_person Entity Object
The object model for the unique entity cp_person is shown in the figure above. In general, we have properties and associations; some aspects, like qualifiedName, are inherited from the underlying Atlas class type Referenceable. You can see a person object template and make a definition in a format that the Atlas REST interface accepts. We’ll talk about that later, but for now, let’s take the JSON below, which defines an entity type:
{ "superTypes": [ "Referenceable" ], "name": "Entity Type Name", "description": "Entity Description", "attributeDefs": [ /* Attribute Definitions */ { "name": "name", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "valuesMinCount": -1, "valuesMaxCount": 1, "isUnique": false, "isIndexable": true, "includeInNotification": false, "searchWeight": -1 } ], "relationshipDefs": [ /* Relationship Definitions */ { "name": "relationshipend-name", "typeName": "relationship-end-type", "isOptional": true, "cardinality": "SET", "valuesMinCount": -1, "valuesMaxCount": -1, "isUnique": false, "isIndexable": false, "includeInNotification": false, "searchWeight": -1, "relationshipTypeName": "relationship-name", "isLegacyAttribute": false } ] }
Generic Entity Object structure
The definition of an entity type contains a few crucial components. In the above JSON example, we have the entity type’s name, its description, inheritance information (aka supertypes), relationships (aka relationshipDefs), and attributes (aka attributeDefs). There could be many characteristics and linkages.
We expected a list of data in both attributeDefs, which holds the entity’s attribute definitions and relationships with other entities, and relationshipDef, which has the specifics of the relationship between two entities. An attribute definition has some significant parameters, like in the definition above.
- Name: the name of a character.
- typeName: an attribute’s type name.
- isOptional: if specified, the attribute’s value is not required.
- cardinality: to determine how many items make up this characteristic.
- isUnique: a flag indicating if this characteristic can serve as a primary key.
- isIndexable: to declare whether this property should be indexed using this attribute.
{ "category": "RELATIONSHIP", "name": " relationship-name ", "description": "relationship description", "relationshipCategory": "ASSOCIATION", "attributeDefs": [], "propagateTags": "NONE", "endDef1": { "type": "end1Type", "name": "relationshipend1name", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false }, "endDef2": { "type": "end2Type", "name": " relationshipend2name ", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false } }
Similarly, a relationship definition has certain key factors too. As in the above example, a relationship between two entities has two ends, known as End1 and End2, identifying the two entities involved.
In addition, we have:
- name: to specify the relationship’s name.
- description: to define the relationship and the entities involved in the relationship.
- relationshipCategory: to specify which category a relationship belongs to.
- propagateTags: a term used to describe tag propagation.
Now we should be able to understand entity definitions, so let’s attempt to define the cp_person sample type.
The image above shows three associations:
- ownerToFolders (of type cp_folder).
- ownerToFiles (of type cp_files).
- ownerToSystems (of type cp_system).
Other than these, there are some other properties like name and email address (of type string). We’ll design our entity definition template using the information we’ve got so far:
{ "superTypes": [ "Referenceable" ], "name": "cp_person", "description": "The cp_person is a person working for the organization and can have access to the files and folders as a part of his work.", "attributeDefs": [ { "name": "name", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "valuesMinCount": -1, "valuesMaxCount": 1, "isUnique": false, "isIndexable": true, "includeInNotification": false, "searchWeight": -1 }, { "name": "email", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "valuesMinCount": -1, "valuesMaxCount": 1, "isUnique": false, "isIndexable": true, "includeInNotification": false, "searchWeight": -1 } ], "relationshipDefs": [ { "name": "ownerToFolders", "typeName": "cp_folder", "isOptional": true, "cardinality": "SET", "valuesMinCount": -1, "valuesMaxCount": -1, "isUnique": false, "isIndexable": false, "includeInNotification": false, "searchWeight": -1, "relationshipTypeName": "cp_folder_owner_relationship", "isLegacyAttribute": false }, { "name": "ownerToFiles", "typeName": "cp_file", "isOptional": true, "cardinality": "SET", "valuesMinCount": -1, "valuesMaxCount": -1, "isUnique": false, "isIndexable": false, "includeInNotification": false, "searchWeight": -1, "relationshipTypeName": "cp_file_owner_relationship", "isLegacyAttribute": false }, { "name": "ownerToSystems", "typeName": "cp_system", "isOptional": true, "cardinality": "SET", "valuesMinCount": -1, "valuesMaxCount": -1, "isUnique": false, "isIndexable": false, "includeInNotification": false, "searchWeight": -1, "relationshipTypeName": "cp_system_owner_relationship", "isLegacyAttribute": false } ] }
In our example, we added supertypes as Referenceable. This will inherit qualifiedName as a property to cp_person entity type. Similar to entity type cp_person, we will also introduce cp_folder, cp_file, cp_system, cp_worksheet and cp_column. Additionally, for each of the entities, we will also create relationships:
{ "category": "RELATIONSHIP", "name": "cp_folder_owner_relationship", "description": "relationship generic_folder_owner_assignment", "relationshipCategory": "ASSOCIATION", "attributeDefs": [], "propagateTags": "NONE", "endDef1": { "type": "cp_folder", "name": "folderOwner", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false }, "endDef2": { "type": "cp_person", "name": "ownerToFolders", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false }
Relationship between Folder and Person
{ "category": "RELATIONSHIP", "name": "cp_file_owner_relationship", "description": "The relationship between a file and person", "relationshipCategory": "ASSOCIATION", "attributeDefs": [], "propagateTags": "NONE", "endDef1": { "type": "cp_file", "name": "fileOwner", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false }, "endDef2": { "type": "cp_person", "name": "ownerToFiles", "isContainer": false, "cardinality": "SET", "isLegacyAttribute": false }
Relationship between File and Person
{ "category": "RELATIONSHIP", "name": "cp_system_owner_relationship", "description": "relationship generic_system_owner_relationship", "relationshipCategory": "ASSOCIATION", "attributeDefs": [], "propagateTags": "NONE", "endDef1": { "type": "cp_system", "name": "systemOwner", "isContainer": false, "cardinality": "SINGLE", "isLegacyAttribute": false }, "endDef2": { "type": "cp_person", "name": "ownerToSystem", "isContainer": false, "cardinality": "SINGLE", "isLegacyAttribute": false } }
Relationship between System and Person
Figure 4: Overall cp_person Definition in Apache Atlas
Registering Atlas Types
So far in this series we’ve learned about entity types, their attributes, and how to create an entity type with special attributes. But these are still unusable if they are not embedded into the active customer Atlas environment.
Luckily, Apache Atlas offers a nice REST interface for adding entity types, entities, classifications, tags, etc. We’ll use a few of the routes mentioned below to register our custom types into our Atlas environment:
Routes | Methods and Description | |
---|---|---|
/api/atlas/v2/types/typedefs | POST: Create new entity type definitions GET: List all the definitions that are registered in Atlas | |
/api/atlas/v2/entity/bulk | POST: Create Entity of a Known Type | |
/api/atlas/v2/types/entitydef/name/{typeName} | GET: Get entity By TypeName | |
/api/atlas/v2/relationship | POST: Create Relationship between entities |
Registering Custom Types in Atlas
We created the cp_person entity type as our first entity type, and now we’re going to add it to Apache Atlas using the REST API. Even though cp_person appears to be complete, it contains references to unbuilt types such as cp_file, cp_folder, and cp_system. Using the understanding from cp_person and the model below as a guide, we will attempt to develop several entity types:
Figure 5: Complete Object Model of Excel use case
Based on the model described above, entity definitions will be created for all linked entity types. We’ll also establish relationships. Entity and relationship definitions must be enclosed in the structure specified below in order to register all entities to Apache Atlas at once:
{ "entityDefs": [ { /* cp_person Definition */ }, { /* cp_folder Definition */ … … } ], "relationshipDefs": [ { /* cp_folder_owner_assignment */}, {/* cp_file_owner_assignment */} … … ] }
Now, register them on Apache Atlas and push the entire entity definition:
Figure 6: Create Entity Definitions
We get the response from Atlas straight away, confirming that the types have been registered:
Figure 7: Response confirming Registration of Entity Definitions
Now we can verify their creation either by using a different REST API route, or by simply viewing them in the Atlas UI, as seen below:
Figure 8: Atlas UI displays new Entity Types
Pushing Data to Entity Types in Atlas
Next, we’ll push data to these entity definitions as we have defined them as placeholders for entity data:
Figure 9: Creating cp_system Entity in Atlas via the REST Interface
Figure 9a: Entity of Type cp_system created
We will now try to create different entities by pushing JSON data to the REST interface.
Some entities require a relationship; we must first establish referential existence before we can create an entity.
Push Relationships to Atlas
Let’s try to create a simple relationship between two entities. We are establishing relationships between several types in the illustration below; the connectivity between each type is constructed utilising unique properties, such as qualifiedName.
Figure 10: Building Relationships between two Entities.
A relationship is established once we have received a response status of 200 from the API call:
Figure 11: Relationships between cp_system and cp_folder.
We can confirm this by looking at the Relationships tab in the Atlas UI. It is clear that the referenced entity has a relationship set up.
We have also set up a GitHub repository for testing purposes: when you run the definitions and send example data to the appropriate API calls, you’ll see a list of entities with the necessary relationships. The screenshot below shows the entities created in Atlas after we executed all of the definitions in the repository’s full_model folder:
Figure 12: Entities created in Atlas
Conclusions
This article has proposed a solution to create our own custom entities in Apache Atlas. This approach involves developing connectors for various services that may not be directly integrated with Apache Atlas, allowing the end user to build connected governance for services that are not currently part of the Cloudera stack. For more details on the code mentioned in this series, you can refer to our GitHub.
If you would like further information about this solution and how it can meet your specific needs, don’t hesitate to contact us; our certified experts will be more than happy to help and guide you along your journey through Business Intelligence, Big Data, Advanced Analytics, and more!