DAIR BoosterPacks are free, curated packages of cloud-based tools and resources about a specific emerging technology, built by experienced Canadian businesses who have built products or services using that technology and are willing to share their expertise.
This BoosterPack was created and authored by: Apption
This package introduces a user-friendly solution for:
- Analyzing unstructured data, identifying data types, and providing storage recommendations
- Identifying sensitive data such as first and last names
- Converting data from unstructured sources onto cloud (or other) SQL Server databases in a few guided steps
The Apption Data Assessment Tool is built on the .NET core platform and can be launched in the CANARIE DAIR Cloud or executed in Electron.NET (embedded browser).
Please see the Sample Solution page for more information on the Solution including how to deploy the sample application.
The Solution showcases the following technologies: Docker, ASP.NET Core, Blazor, Electron.NET.
Machine learning and analytics in complex systems frequently require the addition of external data sets to generate new insights. These data sets are often unstructured, with a large amount of columns, and sensitive data might be hidden in poorly described columns.
Today, to integrate these unstructured files, a data engineer requires many tools and significant effort to understand the data, perform QA and load the data into a central repository. These tools are expensive and feature rich, where data transformation and analysis is included but often with a narrow focus.
Also, if the files contain sensitive information, the environment might require specific security considerations. In Canada, PIPEDA (the Personal Information Protection and Electronic Documents Act) requires corporations to put safeguards around the handling of any personal information.
Existing ETL tools require significant effort to create packages – even for simple files -and end up being a bottleneck in any data exploration or science project. This solution provides a simple 4-step workflow covering the most common tasks.
In addition to the application features, this solution can be used as a template to integrate with the following technologies:
- .NET Core 3 on Linux
- Docker deployment with .NET Core
- Electron.NET for packaging web pages as standalone application
- Visual Studio Solution with common code for Docker and Electron packages
Scalable & Portable Design
The code base is designed for portability across multiple OSes (Linux, Windows, MacOS) and hosts (Docker, Electron). The underlying architecture follows patterns that enable the efficient handling of large data sets.
The API is extensible and other analyzers can be added to identify new data types.
The diagram below illustrates the structure of the solution.
Reference information about the underlying technologies used in creating the solution can be found here; .NET Core, Blazor, Electron, and Docker.
The table below provides a non-comprehensive list of links to tutorials the author has found to be most useful.
|ASP.NET Core is a cross-platform, high-performance, open-source framework for building modern, cloud-based, Internet-connected applications|
|Electron.NET||Electron.NET (built using Electron https://electronjs.org) is a tool that allows the users to host .NET apps across different platforms|
|Docker||Docker technology enables the running of applications (docker images) on Docker Engines which are essentially isolated virtual machines that sit on top of server operating systems.|
- Secure the Docker image
- Write secure ASP.NET Core code
- Configure sensible firewall rules and limit network access as much as possible for Docker deployments
Securing Data at Rest
- SQL Server Always Encrypted – available on all editions
- SQL Server TDE – requires Enterprise Edition
Multi platform hybrid web application for desktop and web
Regulatory landscape for Data in Canada
- Data Residency issues https://blog.privacylawyer.ca/2011/04/cloud-computing-and-privacy-faq.html
- Encryption for sensitive data: https://www.priv.gc.ca/en/privacy-topics/privacy-laws-in-canada/the-personal-information-protection-and-electronic-documents-act-pipeda/pipeda_brief/
Tips and Traps
- Working with Blazor: This new technology has significantly simplified web development by allowing you to write all the front-end logic in C#. However, the lifecycle/rendering of the components needs to be understood for complex user interactions.
- An example of Blazor C# in can be found in the WebAppMaterialize project, in the folder components, and the subfolder pages. Any file ending in .razor will contain client side C#.
- NET Core: It is important to understand IoC and Dependency Injection in order to architect the application properly and design the services.
- In Uploadcontroller.cs in the WebAppMaterialize project, the constructor demonstrates an example of constructor injection, a type of dependency injection.
- Large File Upload: Code on both the client side and server side were required to implement XHR file upload (the technology splits the file into chunks instead of one large upload). JQuery was used on the web client interface and a custom controller was developed for server side.
- The upload functionality can be found in UploadController.cs in the WebAppMaterialize project, in the Controllers folder.
- Multi-threading: Reactive design with Rx.NET was used to streamline the processing pipeline in multiple threads. The configuration of the scheduler threading was a key to separate event processing from the UI feedback updates.
- Multi-threading examples can be found in the StreamReadFileAsync function which is written in FileAnalyzer.cs in the DataTools project.