AWS Architect (Resilience, Testability & Scalability)

  • American IT Systems
  • New York, New York
  • Full Time

Job Title: AWS Architect/SRE (Resilience, Testability & Scalability)

Location: NYC, NY/Fort Mill, SC

Long-term Contract

Note: Looking for a seasoned AWS Architect with 12+ years of experience, who has worked in Financial/Wealth Management sectors and possesses skills in AWS Glue and Chaos Engineering/Testing.

Role Overview:

We are looking for a hands-on, technically strong Resilience, Testability & Scalability Lead to drive engineering excellence across our data platforms and cloud-based applications. This role is critical in ensuring system uptime, test automation maturity, performance under scale, and architectural resilience to meet stringent regulatory and service-level demands.

The ideal candidate will have a deep background in designing highly available systems, implementing robust disaster recovery, managing scalable cloud infrastructure, and building automated, testable, and observable platforms especially within AWS and Kubernetes environments.

Key Responsibilities:

  • Design and implement high availability and failover strategies across multi-zone AWS deployments
  • Lead the development and execution of disaster recovery and business continuity plans, including RTO/RPO validation and cross-region strategies
  • Define testability strategies, test data management frameworks, and performance testing protocols
  • Enable infrastructure and application resilience by introducing circuit breakers, retry patterns, service meshes, and graceful degradation mechanisms
  • Establish real-time monitoring, alerting, and log aggregation frameworks using tools like CloudWatch and Prometheus D
  • Drive test automation and quality engineering best practices, integrating with CI/CD pipelines
  • Optimize application and data layer performance through query tuning, caching, and indexing strategies
  • Scale data processing using distributed frameworks like Apache Spark, and implement event-driven stream processing with Kafka
  • Collaborate with platform, DevOps, and SRE teams to ensure resource efficiency, cost control, and performance SLAs
  • Contribute to regulatory readiness by enforcing security, encryption, and audit logging standards

Required Skills & Experience:

Infrastructure Resilience & DR:

  • Multi-AZ deployments, auto-scaling, load balancing, circuit breakers
  • Disaster recovery design: backup/restore, cross-region replication, RTO/RPO

Monitoring & Observability:

  • Experience with CloudWatch, Prometheus, log aggregators
  • Set up alerting for incident response, latency, throughput, and error rates

Application Resilience & Security:

  • Error handling, service degradation, exponential backoff
  • Security best practices: IAM policies, encryption at rest/transit
  • Familiarity with FINRA/SIPC compliance standards (preferred)

Test Automation & Quality:

  • Unit testing (e.g., pytest), integration testing, E2E automation
  • Test data generation, synthetic data, environment provisioning
  • Performance testing using JMeter, Gatling, stress and capacity testing
  • Code reviews, static analysis, data validation, anomaly detection

Scalability & Optimization:

  • Horizontal scaling using Kubernetes, Docker, service discovery
  • API Gateway, caching layers (Redis, Memcached), DB partitioning
  • Connection pooling, capacity planning, cost-aware architecture

Data & Stream Processing:

  • Spark cluster management, parallel processing, big data optimization
  • Kafka-based messaging, windowing, and aggregation for real-time data

Preferred Qualifications:

  • Experience in financial services or regulated environments
  • Familiarity with enterprise data and platform modernization initiatives
  • AWS or Kubernetes certifications
  • Strong communication skills and cross-functional collaboration experience
Job ID: 483752308
Originally Posted on: 7/2/2025

Want to find more Construction opportunities?

Check out the 176,702 verified Construction jobs on iHireConstruction