The Apption Data Assessment Tool (A-DAT) is an open source project that provides a solution for:
- Bootstrapping data science projects and recognizing over 30 data types
- Uploading CSV files to any cloud SQL Server database (or local)
- Generating an optimized data schema
- Summarizing data fields and providing data quality metrics.
For these tasks, users would traditionally need to invest in expensive solutions built with numerous functions where data transformation has a narrow focus. This solution analyzes the data file in detail and proposes a schema accompanied with informative charts and graphs for each field in the file. The users can customize and control the final database table schema. Finally, the users can upload the data from the file into a database in the cloud.
While many ETL tools already address this issue, they can require significant effort to create packages – even for simple files – and end up being a bottleneck in any data exploration or science project.
This solution provides a minimal workflow, without any data transformations, to load any unstructured file to the cloud in few simple steps and with zero development effort.
The solution can be split into four main sections:
- Connect: The first interaction with the tool allows the user to input settings that the application uses to perform its analysis. These include the source data filename, file format details, and database connection strings.
- Analyze: The initial analysis phase performs a first pass over the data, getting a high-level view of the data, and displays results to the user. Primary keys, basic types, data completion, and other statistics are determined at this phase.
- Classification: The classification step does an in-depth analysis of each field and assigns it a data type. The user can then perform the simple task of verifying which fields will be added to the schema. Sensitive data along with graphs displaying the spread of data within each field are also displayed for the user to make decisions on whether they need to override the application’s suggested data type.
- Upload: This final step allows the user to view the schema to be used in creating the database table. If the user is satisfied with the final schema, the data can then be transferred from the flat file into the database.
|DAIR Cloud Platform
|The DAIR Cloud Platform provides a web-based user interface for participants to access DAIR resources.
|The data recognition component focuses on analyzing the file column by column and assigning appropriate metadata to each column. This component will recognize the major SQL types such as varchar, int, float.
|Sensitive Data Detection
|Sensitive data detection will be used to flag any potentially sensitive data in the files such as names, addresses, etc.
|Web and Electron Interface
|The solution can be executed either inside the Electron framework as a desktop application or executed on the cloud using a standard web portal.
|Cloud Data Upload
|This component will use optimized data transfer interfaces to upload the data from the unstructured file into a cloud database.
|This report will show the result of the analysis and the results of the sensitive data analysis.
Apption Data Assessment Tool Deployment
The deployment of the solution will consist of the following steps:
- Creating a Docker Host in DAIR
- Deploying the Apption Data Assessment Tool (A-DAT) tool through Morpheus
- Deploying a SQL Server instance (optional)
Preparing for Deployment
The Apption Data Assessment Tool (A-DAT) is available in the public Docker Hub in the Apption Repository and will be used in Morpheus to deploy the solution. No additional configuration is required in Morpheus to access the Docker image.
To be able to launch a Docker image, it is necessary to create at least one Docker host. This is done in Infrastructure -> Hosts with Add Azure Docker Host is the recommended type (see figure below).
Screen 2: Configure the host
The following fields are mandatory:
- Cloud – Select the cloud service where you want the host to reside, eg. Azure-Canada
- Name – Choose any name for your Docker host, eg. MSFTdockerHost
Then click “Next”.
Screen 3: Configure Node
At least 2 Cores/4GB is recommended. 7-8GB RAM is recommended to host both SQL Server and the Apption Data Assessment Tool.
The following fields are mandatory:
- Image Type – Select ‘Default’
- Security Groups – Select your default security group
- Availability Set – Select ‘No Availability Set’ (not required)
- Plan – At minimum, select ‘Basic A3 – 4 Core, 7GB Memory’
- Resource Pool – Select your default resource pool
- Volumes (data) – 80GB
- Networks – Select your default network
- Public IP – Select ‘Assign Public IP (EIP)’
You can use the default settings in the remaining screens and click ‘Complete’ in the Review screen. At the end of this step, the Docker host is created on Morpheus and after a short time, should have a “green” health status icon present.
To create an AWS Docker host, the process is almost identical to the Azure host but instead of selecting Azure-Canada for the cloud option, the AWS-Canada option should be used.
Step 2 (Optional): Setup SQL Server on Morpheus
Microsoft SQL Server on Linux can be used as target for the Apption Data Assessment Tool (see deploying SQL Server on Linux through Docker for details)
The SQL Server Image is available on Docker Hub and doesn’t require any additional configuration on the repositories. The following screens assume that a Docker Host has already been created.
Two environment variables need to be configured on the Docker setup:
|Necessary to accept the end user agreement. Otherwise the instance will not start.
|Password for SQL Server SA Administrator
Screen 1: Create SQL Server Docker instance
This is done in Provisioning -> Instances and clicking Add. From the next dialog, select ‘Docker’ for the instance type and click ‘Next’.
The following fields are mandatory to create the SQL Server Docker Instance:
- Group – Select ‘Azure-Canada’ (alternatively select AWS-Canada)
- Cloud – Select ‘Azure’ (alternatively select AWS-Canada)
- Name – Choose any name for your SQL Server Docker instance
- Environment Tag – Select any option (Staging, Production, etc)
Click Next to configure the Docker instance (see figure below).
Screen 2: Setup instance
All above fields are mandatory:
- Layout – Select ‘Single Container’
- Plan – Select a minimum ‘4GB Memory’, ’40GB Storage’
- Volumes – ‘40GB’
- Docker Image – Enter the location of the SQL server Docker image: mcr.microsoft.com/mssql/server
- Image Version – Enter the image version of the Docker image: 2017-latest-ubuntu
- Docker Hub Registry – Select ‘Docker Hub’ (Public)
- Log Volume (default)
- Config Volume (default)
- Data Volume (default)
- Expose ports section – essentially opening up the server for other apps to connect to – for SQL, tcp port 1433 needs to be open
- Environment Variables (see below) – Enter the following:
- ACCEPT_EULA with a value Y
- SA_PASSWORD with any password you choose for your SQL Sever
Once SQL Server has been deployed, it is possible to test the connection using SQL Server Management Studio (SSMS). SSMS can be downloaded here.
The Server name has to follow the convention <IP>,<port> using the public IP and port referenced from the Instances summary screen (e.g. 220.127.116.11:10000 in the above screen). Note: Microsoft SQL Server Management Studio chose to use a comma “,” rather than a colon to seperate the IP and port number for the Server name input field shown below.
Follow the instructions on the wizard to complete the database setup.
Deploying Apption Data Assessment Tool in DAIR Cloud
While the application does a lot of heavy lifting in the back end, one of our goals was to make the front end as simple and intuitive as possible. There is no configuration required once the application is up and running; users simply need to know the location of their data files and the connection string of their database.
Once a Docker host has been created, the creating a Docker instance is easy.
The following details will be used during the configuration/deployment of the Apption Data Assessment Tool containerized application.
Docker Image: apption/adat-os.
Image Version: latest
Expose Port: 8000 (for access via the web interface)
Screen 1: Create Docker instance
This is done in Provisioning -> Instances -> Add. From the next dialogue, select ‘Docker’ for the instance type and click ‘Next’. Screen 2: Configure Docker image and ports (using configuration details provided above)
Similar to the SQL setup, all fields here are mandatory:
- Layout – Select ‘Single Container’
- Plan – Use a minimum 1GB Memory, 10GB Storage
- Volumes – ‘10GB’
- Docker Image – Use Apption’s image: apption/adat-os
- Image Version – latest
- Docker Hub Registry – Select ‘Docker Hub’ (Public)
- Log Volume (default)
- Config Volume (default)
- Data Volume (default)
- Expose Ports Section – It is important that ‘http port 8000’ is exposed for the app to run
Finally, to test the Web Interface, the HTTP link on the instance status will open a new browser window pointing to the host and port for the Docker image.
Refer to the Solution Overview for explanation of the user interface and of the four Steps.
Important note: To access the A-DAT web application, you must provision your machine’s external IP in the Azure Security Group; otherwise, the connection will be refused. [Screen shot to be added once Morpheus v3.6.4 update in production]
Working with Source Code
The following section describes how to set up your development environment should you chose to experiment with the application code, rebuild it, and deploy the new version to see the effects of your changes. To work with the source code, it must be pulled from the CANARIE repository (Gogs) and launched in Visual Studio 2019 (interactive development environment or IDE). From the IDE, the following Run options are available:
- Local Web App execution
- Standalone Electron App
The following software must be installed
1. Microsoft Visual Studio 2019
2. .NET Core 3 Tools – Available from: https://dotnet.microsoft.com/download/dotnet-core/
4. The latest Blazor Language Services extension from the Visual Studio Marketplace. Available from: https://marketplace.visualstudio.com/items?itemName=aspnet.blazor
5. Docker for Developers Available from: https://www.docker.com/get-started
6. Microsoft PowerShell
Prerequisites 1-4 must be in place to allow the Apption Data Assessment Tool to successfully build in Visual Studio.
Prerequisites 5-6 are needed to transfer the binaries from the Apption Data Assessment Tool build into a Docker image for distribution.
Creation of a Docker Image
This step assumes that the Apption Data Assessment Tool source files have already been cloned locally and the solution can be opened and built successfully within Visual Studio. Prerequisites 1-4 above must be satisfied.
The .Net core environment and Blazor extension allows a Windows executable to run on a Linux host. This allows a Docker image to be built on the default Linux OS.
We will be creating a Docker image by pulling a Microsoft-supplied Docker image that already has .Net core and Blazor pre-installed and then adding extra libraries and executables on top. The final result will be a Docker image which can be pulled and run by any Docker client.
Publishing or Building the Apption Data Assessment Tool Locally
- Right-click on the ‘WebAppMaterialize.Server’ project in Solution Explorer and click on ‘Open Folder in File Explorer’.
- This will open a FileExplorer. Drill down the ‘bin’, ‘Debug’ and ‘netcoreapp2.1’ directories.
- You will see a set of .dll, .json, .config files and a file called ‘Dockerfile’ (which has no extension).
- Click in the space just after the full directory path near the top of File Explorer and type Ctrl-C to ‘copy’ the full directory path into the clipboard.
- Go back to Visual Studio and Right-click on the ‘WebAppMaterialize.Server project in Solution Explorer and click on ‘Publish…’.
- Choose ‘FolderProfile’ then click on the ‘Configure…’ link.
- Click within the Target location box and type Ctrl-V to ‘paste’ in the directory path.
- Add ‘/publish’ (without quotes) to the end of the path and press the ‘Save’ button.
Once this process completes, return to the File Explorer. There should be a new directory called ‘publish’ containing about 50 files and three directories. This completes the local publish or build step.
Creating a Docker Image
This step creates a Docker image from the publish directory. It uses Docker client commands within a PowerShell environment.
- Launch PowerShell from the Windows Start menu.
- Confirm that Docker is correctly installed by using the ‘docker version’ command. If successful, it will report on Client and Server configuration.
- Enter ‘cd “’ (cd with one double-quote) and right-click to paste in the project directory from the previous step. You may have to return to the File Manager if this is not still in your copy buffer.
- Add ‘”’ (a closing double-quote) and press enter.
- Enter the ‘ls’ command. Verify that the file called ‘Dockerfile’ (no extension) is in the listing.
- Enter the command ‘docker build -t webappmaterialize .’ (note the trailing ‘.’)
- Docker will now pull a Docker image from Microsoft and copy the executables from the publish directory to create a new Docker image.
- Enter the ‘docker images’ command.
The last command should produce the following output:
Once the Docker image has been created successfully, it can be deployed and run from any Docker client (on your local development machine or redeployed to the Docker host in DAIR). The newly built Docker image can be executed using the following command:
‘docker run -p 8000:80 --rm -it webappmaterialize’
It will produce the following output :
Hosting environment: Production Content root path: /App Now listening on: http://[::]: Application started. Press Ctrl+C to shut down.
At this point you will be able to run A-DAT from a browser from the URL: ‘localhost:8000’
To stop this process, use Ctrl-C.
Apption Data Assessment Tool User Guide
Once successfully deployed, the application is accessible by entering its URL in a web browser. You will be directed to the A-DAT landing page, which looks like the following image.
Step 1: Connect
This is the landing page of the app where the user enters information about the source data file and target database. The “Start Analysis” button will only be enabled once a data source file has been uploaded. It is important that the user also determine what separator is used in the file and whether the file contains header information.
The target database is not mandatory but must be specified if you wish to upload the data from the file into your database. You must specify the internal database IP in the Target Connection String since you are connecting from within the cloud network. You can determine the internal IP of your DB from the Instance Summary view of your database and selecting the ‘Network’ tab as in the screen below.
The data may still be run through analysis only (not uploaded to the DB) if desired.
Once you have completed entering the required information, click the “Start Analysis” button to begin the analytics process. The following page is an example of what will be displayed once the initial analysis has completed.
Step 2: Analysis
The “Analysis” screen displays the results of the initial pass through the data. In this section, all the fields are listed on the left side of the screen. Detailed information for a selected field will be displayed on the remaining part of the screen. This can include field name, basic data type (String or Number), and the potential storage type. Statistics and sample data from the analysis are displayed on the right side for consideration where data uniqueness is also determined. The initial pass provides a simple schema should you choose to use it. Some users may wish to classify fields in more detail, and therefore would move on to the next step by clicking the “Classify Fields” button (not shown) at the bottom of the field list on the left. Once the classification analysis is complete, the following screen should appear:
Step 3: Classification
The field list on the left will now have a more accurate interpretation of each field. The application identifies the specific type of data for each field and indicates to the user if a field contains potentially sensitive information by placement of the icon in the Sensitive column. The “Data Type” drop down list contains all potential data types ranked by probability. More data statistics are introduced as well as new graphs giving a visualization of the data in each field. Scrolling down will reveal a data type probability graph as well. The image below is an example of the probability graph for a FirstName field.
At this point, the user has the ability to make any number of adjustments at the field or file level, such as:
- Change the table name for the database
- Select which fields should be imported
- Change the field names
- Change the data type if necessary
- Change the storage type (used for the schema generation)
- Toggle primary key settings for unique fields
- Select whether fields are nullable
Once the changes are made, simply click the “View Schema” button to progress to the final section of the data transport process.
Step 4: Upload
Once you have completed ‘using the application’ and uploaded your data to your database, simply close the browser tab to terminate the application. In order to free up DAIR Cloud resources, you must delete the Docker Host and Container(s) you created. To clean up both Host and Containers at once, under Infrastructure Hosts, click on the name of the Host you wish to delete, and then click on the ‘Delete’ button. Once presented with the dialogue below, select all three options, type ‘DELETE’ in the bottom field and click on the ‘DELETE’ button. After a short time, the Host and any Docker instances attached to it will be deleted.
The solution is self-contained and only requires a SQL server database to upload the selected file to a database. The schema for the database is generated by the application and there are no other database dependencies.
Security & Network
The solution relies on proven .Net core security (ASP.NET Core, Blazor) and the other security considerations depend on the network configuration of the deployment. A single port inbound is required to access the application through HTTPS. It is recommended to set up the target database within a private network to limit the open ports in the firewall.
The application server should have its firewall settings configured to allow connectivity to the database servers. For testing purposes, if the SQL server and application are running on separate Docker containers, ports 80 and 1433 would need to be exposed so that the containers are able to communicate with each other.
The solution is designed to use multiple cores and effectively use the CPUs to balance the work load. The smaller the file, the faster the processing and uploading of the file. The data upload to the database uses bulk loading to optimize the insertions. However, this mode can be changed in the code to use standard inserts and to accommodate different database configurations.
The application uses standard web infrastructure and can leverage firewalls and load balancers to increase availability using multiple servers.
User Interface (UI)
The user interface is built using Materialize (https://materializecss.com) and any desired UI changes would need to use Materialize to design these changes. It is possible to introduce other open source design interfaces such as Bootstrap, however this would prove challenging as it follows different CSS standards.
The strength of the application lies in its ability to analyze fields and determine the datatypes of these fields. Currently 30 different data recognizers are provided in the solution.
To extend the API with additional data recognizers, the users can add a new class in the recognizers project (called RecognizerTools). The instructions below allow users to add new data recognizers:
- Copy and paste the TemplateRecognizer.cs, and rename it to your desired name
- Open the your new recognizer, change the class attributes (StorageTypes) to any potential storageType of this recognizer and change the return value inside the GetDescription() function with the name of the recognizer
- Change the implemented interfaces by the characteristics of the new recognizer
- Basic Interfaces: INumberRecognizer, ILetterRecognizer, ILetterWithNumberRecognizer
- Length Interfaces: IShortStringRecognizer (0,4], IMediumStringRecognizer (4,11), ILongStringRecognizer [11,…)
- Sensitive Interface: ISensitiveRecognizer
- Must choose one of the basic interfaces, and can choose one or more than one of the length interfaces. Choose the Sensitive interface depending on the type of the recognizer
- Implement the matching algorithm inside the ValidateData() function
- (Optional) – Customize your graph data by changing the GetStatus() function and replace your own way of collecting data with IncrementStats() function
- Look into MoneyRecognizer.cs for more details
- By default, the length of the data is collected
- Add your new DataType to the DataType enumeration types (DataTools.ColumnMetadata.cs)
e.g. If you add a MoneyRecognizer, then add “Money” to the DataType enums.
- Add `RegisterRecognizer(“XXX”, DataType.XXX); ` inside the RegisterRecognizers() function (RecognizerTools.SecondPass.cs) where XXX is your previous created DataType’s name
In a development environment, it is estimated that the solution will cost around $100/month assuming that the database and web server are hosted on an EC2 T2 Large image and run continuously.
On a production environment, it will be necessary to add more nodes and enable load balancing. Each additional node (EC2 T2 Small) will add a cost of approximately $20/month.
The license of the project is The Mozilla Public License v2. All the dependent libraries from Nuget are under compatible licenses.