Data Hub Architecture in AWS

Ramesh Selvaraj
3 min readOct 19, 2021

--

Below is my experience in creating Data Hub Architecture in AWS

Below is the reference Architecture for data hub in AWS

Below is the approach matching to the numbers in above architecture.

1. Direct Connect:

• Evaluate Direct Connect bandwidth available for Data hub project and recommend increase in bandwidth if needed.

• Evaluate for bandwidth for one-time data ingestion and incremental data ingestion.

• If one-time data ingestion volume is high and direct connect bandwidth doesn’t support, look for alternate way of data transfer using Snowball.

2. Secondary Direct connect:

• For high availability, it is recommended to have back up Direct Connect with equal bandwidth as primary Direct connect.

3. SAP HANA:

• SAP HANA is installed in EC2 with multi-client deployment.

• Memory requirement and instance sizing for HANA needs to be calculated based on data that is planned to be ingested and incremental daily data load.

4. Qlik Replicate:

• Qlik replicate is installed in EC2 with multi-client deployment in auto scaling mode. It is installed in private subnet through Elastic load balancer.

• As alternate approach, AWS Arch. will evaluate the feasibility of containerising Qlik in ECS with Fargate.

• Ec2 instance size and type to be designed considering one-time ingestion and daily ingestion volume.

5. S3 Buckets for Analytical Data Lake:

• S3 Buckets with folders for Raw, Format, conform and publish layer will be created with necessary IAM controls, client naming convention, client security controls.

6. EMR:

• EMR managed scaling is used to Automatically resizes cluster for best performance at lowest possible cost.

• EMR workloads for Spark, Hive, Presto will be run in separate clusters and managed by Amazon Managed Workflows for Apache Airflow MWAA.

• For High availability in Hive, multiple master nodes can be launched.

• Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster

7. Talend:

• Talend is installed in EC2 with multi-client deployment and autoscaling mode.

• As alternate approach, AWS Arch. will evaluate the feasibility of containerising Talend in ECS with Fargate.

• It is installed in private subnet through Elastic load balancer

8. Amazon Managed Workflows for Apache Airflow (MWAA):

• Data hub workflow and orchestration is fully managed by MWAA.

• MWAA is a managed service by AWS. Scalability, availability and security is managed by MWAA.

• Data is encrypted using KMS.

• Role-based authentication and authorization is controlled by AWS IAM.

As alternate approach, AWS Arch. will evaluate the feasibility of using AWS Step function.

9. Collibra:

• Collibra is used in SaaS model.

• Least privileges is given to users and necessary IAM controls to be implemented

• Collibra SaaS connectivity is routed back via the on-premises proxies over the Direct Connect connection and then outbound to internet

Security:

• AWS Security hub, Amazon GuardDuty, and Amazon Macie can be used for Security monitoring, security control and compliance

• Below are other security principles that will be used in Data hub.

• Data is encrypted at transit and at rest.

• Reduce blast radius

• Least privilege access

• Apply security at all layers

• Enable Traceability

• Automate security

• Prepare for an incident by having incident management and investigation policy and processes that align to your organizational requirements

• implement detective controls by processing logs, events, and monitoring that allows for auditing, automated analysis, and alarming

• CloudTrail logs, AWS API calls, and CloudWatch provide monitoring of metrics with alarming, and AWS Config provides configuration history.

--

--

Ramesh Selvaraj

Enterprise Cloud Architect, Sr. Director (Cloud), AWS 5x Certified, Virtusa, London