云厂商高端系统使用情况分析

最后更新于:2025-12-12 14:57:47

Comprehensive Analysis of Cloud Infrastructure Architectures: Mainframes, Engineered Systems, and Distributed Consistency

云基础设施架构综合分析:大型机、工程化系统与分布式一致性

1. Introduction: The Divergence of Computational Philosophies

1. 引言:计算哲学的分歧

The contemporary digital infrastructure landscape is characterized by a fundamental tension between two distinct architectural philosophies: the deterministic, hardware-centric model of the mainframe, and the probabilistic, software-defined model of distributed cloud computing. While the prevailing market narrative suggests a linear technological progression from the former to the latter, a rigorous technical analysis reveals a more complex reality. High-value, low-latency workloads—particularly within the financial services sector—continue to necessitate the specific consistency guarantees traditionally associated with mainframe architectures. This requirement has precipitated the emergence of "Engineered Systems" such as Oracle Exadata, which hybridize these opposing philosophies, as well as the implementation of sophisticated software-level integrity mechanisms like ZFS and Write-Ahead Logging (WAL) within cloud environments to emulate hardware-level reliability.1

当代数字基础设施格局的特点在于两种截然不同的架构哲学之间存在的根本张力:即大型机的确定性、硬件为中心的模型,与分布式云计算的概率性、软件定义的模型。尽管当前的市场叙事暗示了从前者向后者线性的技术演进,但严谨的技术分析揭示了一个更为复杂的现实。高价值、低延迟的工作负载——特别是在金融服务领域——继续需要传统上与大型机架构相关的特定一致性保证。这种需求促使了诸如Oracle Exadata等“工程化系统”的出现,这些系统混合了上述两种对立的哲学,同时也推动了在云环境中实施复杂的软件级完整性机制(如ZFS和预写式日志WAL),以模拟硬件级的可靠性 1。

This report provides an exhaustive, bilingual examination of these competing paradigms. It analyzes the granular economic models that drive architectural decisions, the divergence in fault tolerance strategies between hardware and software, the specific implementation of hybrid systems in banking, and the mathematical realities of distributed data consistency. By deconstructing these elements, we elucidate how modern cloud providers and enterprises are navigating the trade-offs between the "digital vault" security of mainframes and the elastic scalability of the cloud.

本报告提供了对这些竞争范式的详尽双语审查。它分析了驱动架构决策的精细经济模型、硬件与软件在容错策略上的分歧、混合系统在银行业中的具体实施,以及分布式数据一致性的数学现实。通过解构这些要素,我们阐明了现代云提供商和企业如何在大型机的“数字金库”式安全性与云的弹性可扩展性之间权衡取舍。

2. The Economic and Operational Divergence: Cloud vs. Mainframe Models

2. 经济与运营的分歧:云与大型机模型

The transition from mainframe computing to cloud infrastructure is frequently framed as a cost-reduction strategy; however, the underlying economic models represent fundamentally incompatible methods of valuing computation. A profound understanding of these billing structures is a prerequisite for architectural comprehension, as the technical architecture often mirrors the billing model—a manifestation of Conway’s Law applied to economics.

从大型机计算向云基础设施的过渡通常被构建为一种降低成本的策略;然而,底层的经济模型代表了根本上不兼容的计算估值方法。深刻理解这些计费结构是理解架构的前提,因为技术架构往往反映了计费模型——这是康威定律在经济学上的体现。

2.1 The Hyper-Granularity of Cloud Billing Mechanisms

2.1 云计费机制的超细颗粒度

Modern cloud billing has evolved far beyond simplistic "per hour" compute charges, introducing a level of granularity that effectively monetizes every discrete Input/Output (I/O) operation. As detailed in the Atomic Cloud Billing Guide and other provider terms, the decomposition of services into billable micro-units creates a complex financial landscape.4

现代云计费已经远远超越了简单的“按小时”计算费用,引入了有效地将每一次离散的输入/输出(I/O)操作货币化的颗粒度水平。正如Atomic Cloud计费指南和其他提供商条款中所详述的,服务被分解为可计费的微单元,创造了一个复杂的财务图景 4。

One of the most significant differentiators is the billing of I/O throughput. Providers explicitly charge for "Virtual Server I/O per Million," meaning that bills are calculated based on the sheer volume of millions of I/O operations performed during the billing cycle. This contrasts sharply with on-premise hardware where I/O capacity is a fixed capital cost. Consequently, a "chatty" application that performs frequent small reads and writes—typical of legacy mainframe designs—can generate exorbitant costs in a cloud environment due to this metric.5

最显著的区别之一是对I/O吞吐量的计费。提供商明确收取“每百万次虚拟服务器I/O”的费用,这意味着账单是根据计费周期内执行的数百万次I/O操作的绝对数量来计算的。这与本地硬件形成鲜明对比,后者的I/O容量是固定的资本成本。因此,一个执行频繁小规模读写操作的“健谈”应用程序——这是传统大型机设计的典型特征——由于这一指标,可能会在云环境中产生极其高昂的成本 5。

Furthermore, network egress is rigorously metered. "Firewall Egress Traffic" is billed based on the gigabytes of data that exit the provider's firewall. This creates a distinct economic incentive known as "data gravity," compelling organizations to keep data resident within the provider's ecosystem to avoid these fees. Unlike a mainframe's internal bus or a private data center's dark fiber, where moving data between a core ledger and a reporting system incurs no marginal cost, cloud architectures penalize data mobility across boundaries.5

此外,网络出口流量受到严格计量。“防火墙出口流量”是根据离开提供商防火墙的数据千兆字节数来计费的。这创造了一种被称为“数据引力”的独特经济激励,迫使组织将数据保留在提供商的生态系统中以避免这些费用。与大型机的内部总线或私有数据中心的暗光纤不同(在这些环境中,在核心账本和报告系统之间移动数据不产生边际成本),云架构会对跨边界的数据移动进行惩罚 5。

The complexity extends to storage and disaster recovery (DR) models. Backup storage is often aggregated hourly, with values summed and divided by the hours in the period to determine an average usage. More critically, "DR Reservation" fees introduce a sophisticated insurance-like model. Providers may charge a reservation fee—calculated as a percentage (e.g., 20%) of the production CPU, RAM, and storage costs—merely to hold the capacity in reserve. This is separate from the costs incurred during actual data transfer or I/O consumption in the DR environment. This decoupling of "reservation" from "utilization" forces architects to model financial risk alongside technical risk.5

这种复杂性延伸到了存储和灾难恢复(DR)模型。备份存储通常按小时聚合,将数值求和并除以期间的小时数以确定平均使用量。更为关键的是,“DR预留”费用引入了一种复杂的类保险模型。提供商可能会收取预留费——按生产CPU、RAM和存储成本的百分比(例如20%)计算——仅仅是为了保留容量。这是独立于在DR环境中实际数据传输或I/O消耗所产生的费用的。这种“预留”与“使用”的脱钩迫使架构师在建模技术风险的同时也要建模财务风险 5。

2.2 The Deterministic Mainframe Economic Model

2.2 确定性的大型机经济模型

In contrast to the granular, consumption-based cloud model, the mainframe economic model has historically revolved around metrics like "MIPS" (Millions of Instructions Per Second) or "MSU" (Millions of Service Units). While cloud billing is elastic and variable (OpEx), mainframe billing is often deterministic and capacity-based. The mainframe acts as a "digital vault"—a substantial upfront investment that offers predictable performance for high-volume transactions without the variability of micro-transactional billing.2

与精细的、基于消费的云模型相反,大型机经济模型历史上一直围绕着诸如“MIPS”(每秒百万条指令)或“MSU”(百万服务单元)等指标。虽然云计费是弹性且可变的(运营支出OpEx),但大型机计费通常是确定性的且基于容量的。大型机充当“数字金库”——这是一笔巨大的前期投资,为大容量事务提供可预测的性能,而没有微交易计费的变动性 2。

The friction encountered when migrating from mainframe to cloud often stems from this billing mismatch. A mainframe workload is optimized for massive I/O throughput and CPU density within a single architectural footprint. "Lifting and shifting" such a workload to the cloud often results in significant financial inefficiencies because the application logic is not designed to minimize the granular I/O, storage retrieval, and egress costs inherent to the cloud model. Reports indicate that up to 94% of enterprises overspend in the cloud due to ineffective management of these resources and a lack of understanding of cost allocation tags.6

从大型机迁移到云时遇到的摩擦通常源于这种计费不匹配。大型机工作负载针对单一架构足迹内的大规模I/O吞吐量和CPU密度进行了优化。将此类工作负载“直接迁移”到云端通常会导致显著的财务低效,因为应用程序逻辑的设计并未考虑到最小化云模型固有的精细I/O、存储检索和出口成本。报告显示,高达94%的企业因对这些资源管理无效且缺乏对成本分摊标签的理解而在云端超支 6。

3. Philosophies of Fault Tolerance: Hardware vs. Software

3. 容错哲学:硬件与软件

The defining differentiator between mainframe and cloud infrastructure extends beyond physical location (on-premise versus remote) to the fundamental philosophy of failure management. These architectures represent opposing views on where the responsibility for reliability should reside: within the specialized hardware or distributed across the software stack.

大型机和云基础设施之间的决定性区别不仅仅在于物理位置(本地与远程),还延伸到了故障管理的根本哲学。这些架构代表了关于可靠性责任应归于何处的对立观点:是归于专用硬件内部,还是分布在软件栈中。

3.1 Hardware-Centric Fault Tolerance: The Mainframe Approach

3.1 硬件为中心的容错:大型机方法

Mainframes are engineered around a philosophy of "Hardware Reliability." The system is designed to mask faults at the physical level, ensuring that the operating system and application layers remain oblivious to underlying component failures. This approach classifies faults based on duration—transient versus permanent—rather than source.7

大型机是围绕“硬件可靠性”的哲学构建的。该系统的设计旨在物理层面上掩盖故障,确保操作系统和应用层对底层组件的故障一无所知。这种方法根据持续时间——瞬态与永久——而不是来源来对故障进行分类 7。

Transient faults, such as a bit flip caused by cosmic radiation or a momentary voltage spike, are handled through hardware-level masking or voting logic. For example, redundant processors may execute the same instruction simultaneously; if their results diverge, the system logic votes to determine the correct outcome without interrupting the workflow. Permanent faults trigger a transparent system reconfiguration, bypassing the failed component instantaneously.7

瞬态故障,例如由宇宙辐射引起的比特翻转或瞬间电压尖峰,通过硬件级掩蔽 or 投票逻辑来处理。例如,冗余处理器可以同时执行相同的指令;如果它们的结果出现分歧,系统逻辑会投票决定正确的结果,而不会中断工作流。永久性故障会触发透明的系统重配置,瞬间绕过故障组件 7。

This architecture allows mainframes to operate as "digital vaults," offering ultra-secure and predictable environments for workloads where consistency is paramount. The reliability is such that mainframe uptime is frequently measured in decades. This is achieved because the hardware itself—including processors, memory modules, and I/O channels—is deeply redundant. If a primary processor fails, a backup processor assumes control instantly, preserving the system state and preventing a crash of the operating system or application.2

这种架构使得大型机能够像“数字金库”一样运作,为一致性至关重要的工作负载提供超安全和可预测的环境。其可靠性之高,以至于大型机的正常运行时间经常以十年为单位来衡量。这是因为硬件本身——包括处理器、内存模块和I/O通道——具有深度的冗余。如果主处理器出现故障,备用处理器会立即接管控制权,保存系统状态并防止操作系统或应用程序崩溃 2。

3.2 Software-Centric Fault Tolerance: The Cloud Approach

3.2 软件为中心的容错:云方法

Conversely, cloud computing operates on a philosophy of "Software Fault Tolerance." In this paradigm, the underlying hardware is treated as commoditized and inherently unreliable. Reliability is not a property of the individual component but an emergent property of the software distributed across massive clusters of these unreliable components.9

相反,云计算基于“软件容错”的哲学运作。在这个范式中,底层硬件被视为商品化的且本质上不可靠。可靠性不是单个组件的属性,而是分布在大量这些不可靠组件集群之上的软件的涌现属性 9。

Cloud providers such as AWS, Azure, and Google Cloud operate strictly distributed models where reliability is achieved through redundancy across availability zones. If a physical server fails, the orchestration software detects the failure and migrates the workload to a healthy server. This model necessitates a shift in application logic known as the "Cattle vs. Pets" paradigm. In the mainframe model, servers are "pets"—unique, indispensable, and nursed back to health. In the cloud model, servers are "cattle"—replaceable and interchangeable. Applications must be "cloud-native," capable of handling connection drops, latency spikes, and node failures gracefully without data loss.2

AWS、Azure和Google Cloud等云提供商运营严格的分布式模型,通过跨可用区的冗余来实现可靠性。如果物理服务器出现故障,编排软件会检测到故障并将工作负载迁移到健康的服务器。这种模型需要应用程序逻辑发生转变,即所谓的“牛与宠物”范式。在大型机模型中,服务器是“宠物”——独特、不可或缺且需要精心护理。在云模型中,服务器是“牛”——可替代且可互换。应用程序必须是“云原生”的,能够优雅地处理连接中断、延迟峰值和节点故障,而不会丢失数据 2。

While mainframes offer vertical scalability (adding more power to a single node) and determinism, the cloud offers horizontal scalability (adding more nodes). The trade-off is that distributed systems introduce non-determinism—network delays and eventual consistency issues that mainframes generally avoid. Security in the cloud is based on a "Shared Responsibility Model," where the provider secures the infrastructure (data centers, power, hardware), but the customer is responsible for securing the applications and data configurations built on top.2

虽然大型机提供垂直可扩展性(向单个节点添加更多算力)和确定性,但云提供水平可扩展性(添加更多节点)。权衡之处在于分布式系统引入了非确定性——网络延迟和最终一致性问题,这通常是大型机所避免的。云中的安全性基于“责任共担模型”,提供商负责基础设施(数据中心、电力、硬件)的安全,但客户负责保护构建在其上的应用程序和数据配置 2。

3.3 The Nuance of Resilience vs. Fault Tolerance

3.3 弹性与容错的细微差别

There is a technically significant distinction between "Fault Tolerance" and "Resilience" in this context.

在这种语境下,“容错”与“弹性”之间存在技术上显著的区别。

Fault Tolerance: The system continues to operate without interruption or loss of performance despite the occurrence of a fault. The end-user remains completely unaware of the issue. This is the operational ideal of the mainframe.12

容错 (Fault Tolerance): 尽管发生故障,系统仍能不中断且不损失性能地继续运行。最终用户对问题完全不知情。这是大型机的运营理想 12。

Resilience: The system adapts to the error, maintaining service but potentially acknowledging a certain impact on performance or requiring a retry mechanism (e.g., a slower response time while a replica is promoted). This is the operational reality of the cloud.12

弹性 (Resilience): 系统适应错误,维持服务但可能承认对性能有一定影响或需要重试机制(例如,在副本被提升时响应时间变慢)。这是云的运营现实 12。

For banking core systems and high-frequency trading environments, "Resilience"—where a transaction might fail and require retrying—is often unacceptable compared to "Fault Tolerance," where the transaction completes despite hardware failure. This distinction explains the persistence of mainframes in critical financial sectors.1

对于银行核心系统和高频交易环境而言,“弹性”——即交易可能失败并需要重试——与“容错”(交易尽管硬件故障仍能完成)相比通常是不可接受的。这一区别解释了大型机在关键金融领域持续存在的原因 1。

4. The Hybrid Apex: Oracle Exadata and Engineered Systems

4. 混合巅峰:Oracle Exadata与工程化系统

Between the monolithic mainframe and the distributed commodity cloud lies the "Engineered System." Oracle Exadata is the preeminent example of this category, effectively functioning as the "Mainframe of the Cloud." It attempts to integrate the hardware-software cohesion of the mainframe into the x86 ecosystem to solve the performance bottlenecks of distributed systems.

在单体大型机和分布式商品云之间存在着“工程化系统”。Oracle Exadata是这一类别的卓越代表,有效地充当了“云端大型机”。它试图将大型机的硬件-软件内聚性整合到x86生态系统中,以解决分布式系统的性能瓶颈。

4.1 Architecture of Exadata: Hardware-Software Co-Design

4.1 Exadata架构:硬件-软件协同设计

Exadata is not merely a server; it is a comprehensive database machine composed of database servers and intelligent storage servers connected by a high-speed, low-latency fabric, specifically RDMA over Converged Ethernet (RoCE) or InfiniBand.14

Exadata不仅仅是一台服务器;它是一台综合的数据库机器,由数据库服务器和通过高速低延迟网络连接的智能存储服务器组成,具体使用的是基于融合以太网的RDMA(RoCE)或InfiniBand 14。

The critical advantage lies in the integration: the hardware (servers, networking, storage) and software (Oracle Database, Exadata System Software) are engineered together. This allows for optimizations that are impossible in generic cloud environments where the database software is agnostic to the underlying hardware. Exadata is available in multiple deployment models, including on-premise, "Cloud@Customer" (where the cloud hardware resides in the client's data center), and as a public cloud service (Exadata Cloud Service). This flexibility allows financial institutions like Fibabanka and LALUX to maintain strict data sovereignty and compliance while accessing cloud operating models.15

关键优势在于集成:硬件(服务器、网络、存储)和软件(Oracle数据库、Exadata系统软件)是协同工程化的。这使得在通用云环境中不可能实现的优化成为可能,因为在通用云中,数据库软件对底层硬件是一无所知的。Exadata提供多种部署模式,包括本地部署、“Cloud@Customer”(云硬件驻留在客户数据中心)以及作为公共云服务(Exadata Cloud Service)。这种灵活性使得像Fibabanka和LALUX这样的金融机构能够在保持严格的数据主权和合规性的同时,访问云运营模型 15。

4.2 Key Technology: SQL Offloading and Smart Scan

4.2 关键技术:SQL卸载与智能扫描

The defining feature of Exadata is "SQL Offloading" via "Smart Scan," a mechanism designed to eliminate the I/O bottlenecks inherent in traditional architectures. In a standard system, the database server requests blocks of data from the storage array, pulls them across the network, and then filters them in memory to find the relevant rows. This creates massive data movement overhead.17

Exadata的标志性特征是通过“智能扫描”实现的“SQL卸载”,这是一种旨在消除传统架构中固有的I/O瓶颈的机制。在标准系统中,数据库服务器从存储阵列请求数据块,通过网络拉取它们,然后在内存中过滤它们以找到相关行。这产生了巨大的数据移动开销 17。

The Smart Scan Solution: Exadata pushes the SQL processing logic down to the storage cells. The database server sends the query predicate (e.g., WHERE amount > 200) directly to the storage server. The storage server filters the data locally and sends only the matching rows back to the database server. This dramatically reduces network traffic and CPU load on the database nodes.17

智能扫描解决方案: Exadata将SQL处理逻辑下推到存储单元。数据库服务器将查询谓词(例如 WHERE amount > 200)直接发送给存储服务器。存储服务器在本地过滤数据,并仅将匹配的行发送回数据库服务器。这极大地减少了网络流量和数据库节点的CPU负载 17。

Specifically, Smart Scan performs several offloaded functions:

具体而言,智能扫描执行多项卸载功能:

Predicate Filtering: Evaluating conditions like amount > 200 at the storage layer.19

Column Filtering: Implementing "Projection," where only the requested columns (e.g., SELECT customer_name) are returned, rather than the entire row.19

Decryption Offload: CPU-intensive decryption tasks are handled by the storage processors, freeing up database server resources for transaction processing.19

Hybrid Columnar Compression (HCC): Decompression runs in the Exadata Storage Servers, allowing scans on compressed data to proceed rapidly.19

Administrators can control this feature via initialization parameters. The parameter CELL_OFFLOAD_PROCESSING enables or disables the feature, while CELL_OFFLOAD_PLAN_DISPLAY controls whether the SQL EXPLAIN PLAN output explicitly shows the predicates being evaluated by the storage cells (labeled as "STORAGE" predicates).17

管理员可以通过初始化参数控制此功能。参数 CELL_OFFLOAD_PROCESSING 启用或禁用该功能,而 CELL_OFFLOAD_PLAN_DISPLAY 控制SQL EXPLAIN PLAN输出是否显式显示由存储单元评估的谓词(标记为“STORAGE”谓词)17。

4.3 High Availability via MAA (Maximum Availability Architecture)

4.3 通过MAA(最大可用性架构)实现高可用性

Oracle's "Maximum Availability Architecture" (MAA) on Exadata is designed to mimic mainframe reliability standards. It utilizes Real Application Clusters (RAC) for active-active database nodes and Automatic Storage Management (ASM) for mirroring data across storage cells. If a storage cell fails, ASM automatically rebalances data to restore redundancy without downtime. For disaster recovery, it integrates with Data Guard to replicate data to a standby site, ensuring that the "Single Point of Failure" is eliminated at every tier.21

Oracle在Exadata上的“最大可用性架构”(MAA)旨在模拟大型机的可靠性标准。它利用Real Application Clusters (RAC)实现双活数据库节点,并利用Automatic Storage Management (ASM)在存储单元之间镜像数据。如果存储单元发生故障,ASM会自动重新平衡数据以恢复冗余,且无需停机。在灾难恢复方面,它与Data Guard集成,将数据复制到备用站点,确保在每一层消除“单点故障” 21。

5. Distributed Data Consistency: The Challenge of the "Single Writer"

5. 分布式数据一致性:“单一写入者”的挑战

While Engineered Systems solve I/O bottlenecks through hardware integration, the distributed nature of the cloud creates profound challenges in data consistency. In a mainframe, memory is coherent and access is centralized. In a distributed cloud, the "truth" is fragmented across regions, necessitating complex consensus mechanisms.

虽然工程化系统通过硬件集成解决了I/O瓶颈,但云的分布式特性在数据一致性方面带来了深刻的挑战。在大型机中,内存是一致的,访问是集中的。在分布式云中,“真相”碎片化地分布在各个区域,需要复杂的共识机制。

5.1 The Single Writer Principle and Consistency Models

5.1 单一写入者原则与一致性模型

To maintain strong consistency in a distributed system, architects often revert to the "Single Writer Principle." This principle dictates that to avoid conflicts, only one actor (thread, node, or region) is allowed to write to a specific piece of data at any given time. This effectively serializes updates, ensuring a clean, linear history without the complexity of merge conflicts.23

为了在分布式系统中维持强一致性,架构师经常回归“单一写入者原则”。该原则规定,为了避免冲突,在任何给定时间,只允许一个参与者(线程、节点或区域)写入特定的数据片段。这有效地序列化了更新,确保了清晰、线性的历史记录,而没有合并冲突的复杂性 23。

In multi-region database deployments, such as Amazon Aurora Global Database, a common design pattern implementing this principle is "Read Local, Write Global." In this architecture, applications in all regions perform reads locally to minimize latency, but all writes are forwarded to a primary "writer" region. This writer node replicates changes to the read-only replicas in other regions. While this introduces latency for write operations, it prevents "split-brain" scenarios where two regions act as primary and accept conflicting updates simultaneously.24

在多区域数据库部署中,例如Amazon Aurora全球数据库,实施这一原则的常见设计模式是“本地读,全局写”。在这种架构中,所有区域的应用程序在本地执行读取以最小化延迟,但所有写入都被转发到主“写入”区域。该写入节点将更改复制到其他区域的只读副本。虽然这为写入操作引入了延迟,但它防止了“脑裂”场景,即两个区域同时充当主节点并接受冲突的更新 24。

Systems like Azure Cosmos DB offer a spectrum of consistency levels to bridge the gap between strong consistency and performance. "Session Consistency," for instance, guarantees that within a single client session, the user typically experiences "Read-Your-Writes" consistency. This assumes a single writer session or the sharing of a session token, providing a predictable experience for the user while allowing the backend to remain eventually consistent across the broader cluster.25

像Azure Cosmos DB这样的系统提供了一系列一致性级别,以弥合强一致性与性能之间的差距。例如,“会话一致性”保证在单个客户端会话内,用户通常体验到“读己之写”的一致性。这假设是单个写入者会话或共享会话令牌,为用户提供可预测的体验,同时允许后端在更广泛的集群中保持最终一致性 25。

5.2 Distributed Locking: Redlock and its Controversies

5.2 分布式锁:Redlock及其争议

When multiple writers must exist and coordinate, distributed locking becomes necessary. However, distributed locks are inherently more fragile than local mutexes. On a single machine, the OS kernel enforces mutual exclusion; if a thread holding a lock crashes, the OS can release it. In a distributed system, if a node holding a lock fails, the lock might persist indefinitely unless a lease or timeout mechanism is employed.27

当必须存在多个写入者并进行协调时,分布式锁就变得必要。然而,分布式锁本质上比本地互斥锁更脆弱。在单台机器上,操作系统内核强制执行互斥;如果持有锁的线程崩溃,操作系统可以释放它。在分布式系统中,如果持有锁的节点发生故障,除非采用租约或超时机制,否则锁可能会无限期持续 27。

A prominent solution is the Redlock algorithm used with Redis. It attempts to provide distributed locking by requiring a client to acquire locks from a majority (N/2 + 1) of Redis nodes. Crucially, Redlock relies on "wall clock" time to determine lock validity. This reliance introduces a significant vulnerability: clock skew. If the system clock on a server jumps (e.g., due to NTP synchronization), it might expire a lock prematurely while the client still believes it holds it. This violation of the lock's mutual exclusion property can lead to data corruption in systems requiring strict serialization. Critics argue that for true correctness, systems should rely on fencing tokens or monotonic clocks rather than time-of-day clocks.28

一个著名的解决方案是Redis使用的Redlock算法。它试图通过要求客户端从多数(N/2 + 1)Redis节点获取锁来提供分布式锁定。关键是,Redlock依赖“挂钟”时间来确定锁的有效性。这种依赖引入了一个显著的漏洞:时钟偏差。如果服务器上的系统时钟发生跳变(例如,由于NTP同步),它可能会在客户端仍认为自己持有锁的情况下过早地使锁过期。这种对锁互斥属性的违反可能会导致需要严格序列化的系统中的数据损坏。批评者认为,为了真正的正确性,系统应该依赖屏蔽令牌(fencing tokens)或单调时钟,而不是日历时钟 28。

To mitigate scalability bottlenecks in locking, strategies such as Fine-Grained Locking are employed. Instead of locking an entire dataset or counter, the resource is split into "buckets" (sharding). For example, a high-concurrency counter can be split into N buckets; threads pick a random bucket to increment, reducing the probability of collision to 1/N. This technique, combined with "Optimistic Concurrency Control" (OCC)—which uses version numbers and Compare-And-Swap (CAS) operations instead of heavy locks—allows distributed systems to scale write throughput effectively.29

为了减轻锁定中的可扩展性瓶颈,采用了诸如细粒度锁定等策略。不是锁定整个数据集或计数器,而是将资源拆分为“桶”(分片)。例如,高并发计数器可以拆分为N个桶;线程随机选择一个桶进行递增,将碰撞概率降低到1/N。这种技术与“乐观并发控制”(OCC)——使用版本号和比较并交换(CAS)操作代替重型锁——相结合,使得分布式系统能够有效地扩展写入吞吐量 29。

6. Data Integrity at the Bit Level: ZFS and Checksums

6. 比特级的数据完整性:ZFS与校验和

In the absence of mainframe hardware that guarantees data integrity through specialized circuitry, software file systems must assume the hardware is fallible. ZFS (Zettabyte File System) represents the gold standard for this "end-to-end" software integrity, effectively implementing mainframe-grade reliability on commodity hardware.

在缺乏通过专用电路保证数据完整性的大型机硬件的情况下,软件文件系统必须假设硬件是不可靠的。ZFS(泽字节文件系统)代表了这种“端到端”软件完整性的黄金标准,有效地在商用硬件上实现了大型机级的可靠性。

6.1 The Threat: Silent Data Corruption and Bit Rot

6.1 威胁:静默数据损坏与比特腐烂

Commodity hardware is imperfect. Disk firmware contains bugs, cables suffer from electromagnetic interference, and cosmic rays can flip bits in RAM. Research indicates that memory errors occur at rates ranging from 25,000 to 70,000 FIT (Failures In Time) per Mbit. A single bit flip in a file system's metadata can render an entire storage pool unreadable.31

商用硬件是不完美的。磁盘固件包含漏洞,电缆遭受电磁干扰,宇宙射线可能会翻转RAM中的比特。研究表明,内存错误的发生率在每兆比特25,000到70,000 FIT(故障率单位)之间。文件系统元数据中的单个比特翻转可能导致整个存储池无法读取 31。

Traditional file systems typically rely on the hard drive to report errors. However, if a drive "silently" writes corrupt data (a Phantom Write) or writes data to the wrong sector, the drive controller may report success. Without an external validation mechanism, the file system accepts the corruption as valid data. While ECC (Error-Correcting Code) RAM helps, it typically only catches single-bit errors within the memory module itself, failing to detect corruption that occurs during transit across the bus or network interfaces.32

传统文件系统通常依赖硬盘驱动器来报告错误。然而,如果驱动器“静默”地写入损坏的数据(幻影写入)或将数据写入错误的扇区,驱动器控制器可能会报告成功。没有外部验证机制,文件系统会将损坏的数据作为有效数据接收。虽然ECC(纠错码)RAM有所帮助,但它通常只能捕获内存模块本身的单比特错误,无法检测到在跨总线或网络接口传输过程中发生的损坏 32。

6.2 ZFS End-to-End Integrity via Merkle Trees

6.2 ZFS通过Merkle树实现的端到端完整性

ZFS addresses this by adopting a "trust nothing" philosophy. Unlike systems that simply checksum the data block, ZFS checksums the pointer to the data. This creates a self-validating Merkle tree structure.

ZFS通过采用“互不信任”的哲学来解决这个问题。与仅仅校验数据块的系统不同,ZFS校验指向数据的指针。这创建了一个自我验证的Merkle树结构。

Mechanism: When a parent block points to a child block (data), the parent stores the checksum of the child. When ZFS reads the child block, it calculates the checksum and compares it to the value stored in the parent. If the drive returns the wrong block (a misdirected read) or corrupt data, the checksums will not match. Because the checksum is stored separately from the data itself (in the parent), ZFS can detect "Phantom Writes" where the data on disk is internally consistent but incorrect in the context of the file system tree.3

机制: 当父块指向子块(数据)时,父块存储子块的校验和。当ZFS读取子块时,它计算校验和并将其与存储在父块中的值进行比较。如果驱动器返回错误的块(误导读取)或损坏的数据,校验和将不匹配。由于校验和与数据本身分开存储(在父块中),ZFS可以检测到“幻影写入”,即磁盘上的数据内部一致但在文件系统树的上下文中是不正确的 3。

Self-Healing: In a mirrored or RAID-Z configuration, if ZFS detects a checksum mismatch, it does not merely report an error. It automatically reads the redundant copy from another mirror or parity block. If the copy is valid, ZFS repairs the corrupt block on the fly and returns the correct data to the application. This capability transforms the system from being merely error-detecting to being error-correcting, effectively implementing "Resilience" that mimics the "Fault Tolerance" of mainframes.3

自愈: 在镜像或RAID-Z配置中,如果ZFS检测到校验和不匹配,它不仅仅是报告错误。它会自动从另一个镜像或奇偶校验块读取冗余副本。如果副本有效,ZFS会即时修复损坏的块并将正确的数据返回给应用程序。这种能力将系统从仅仅是错误检测转变为错误纠正,有效地实现了模仿大型机“容错”的“弹性” 3。

6.3 Copy-on-Write (CoW) and Transactional Integrity

6.3 写时复制(CoW)与事务完整性

ZFS utilizes a transactional Copy-on-Write (CoW) model to ensure the file system is always consistent on disk, eliminating the need for fsck (file system check) after a crash.

ZFS利用事务性写时复制(CoW)模型来确保文件系统在磁盘上始终是一致的,消除了崩溃后进行 fsck(文件系统检查)的需要。

The CoW Process: When data is modified, ZFS does not overwrite the existing data in place (which would risk data loss if power failed mid-write). Instead, it writes the new data to a newly allocated block. Once the write is complete and verified, ZFS updates the metadata pointers up the tree to point to the new block. The final step is an atomic update of the "Uberblock" (the root of the file system). This ensures that the file system transitions atomically from "Valid State A" to "Valid State B." If power fails at any point before the Uberblock update, the system simply reverts to State A, as the old data was never overwritten.35

CoW过程: 当数据被修改时,ZFS不会原地覆盖现有数据(如果中途断电,这会有数据丢失的风险)。相反,它将新数据写入新分配的块。一旦写入完成并经过验证,ZFS会更新树向上的元数据指针以指向新块。最后一步是“Uberblock”(文件系统的根)的原子更新。这确保了文件系统从“有效状态A”原子地过渡到“有效状态B”。如果在Uberblock更新之前的任何一点断电,系统只需恢复到状态A,因为旧数据从未被覆盖 35。

This architecture also enables instantaneous snapshots and clones. Since data is immutable, a snapshot is simply a copy of the metadata pointers at a specific point in time. It consumes no additional disk space until new data diverges from the snapshot, allowing for efficient backups and testing environments.35

这种架构还实现了即时快照和克隆。由于数据是不可变的,快照只是特定时间点元数据指针的副本。在数据与快照发生偏离之前,它不消耗额外的磁盘空间,从而允许高效的备份和测试环境 35。

7. Write-Ahead Logging (WAL): The Universal Journal

7. 预写式日志(WAL):通用日志

Underpinning both mainframes and modern cloud databases is the concept of the Write-Ahead Log (WAL). This technique serves as the fundamental guarantor of atomicity and durability (the 'A' and 'D' in ACID properties).

支撑大型机和现代云数据库的是预写式日志(WAL)的概念。这项技术作为原子性和持久性(ACID属性中的'A'和'D')的基本保障。

7.1 Historical Origins and Concept

7.1 历史起源与概念

The concept of logging data changes originated in the mainframe era (e.g., IBM System R) but has become ubiquitous in systems like PostgreSQL and Oracle. WAL dictates a strict rule: a modification must be written to a secure, append-only log on stable storage before it is applied to the main database page.

记录数据更改的概念起源于大型机时代(例如IBM System R),但在PostgreSQL和Oracle等系统中已变得无处不在。WAL规定了一条严格的规则:修改必须先被写入稳定存储上的安全、仅追加的日志中,然后才能应用到主数据库页面。

This solves the "torn page" problem, where a system crash occurs halfway through writing a database block to disk. Without WAL, the database would be left in a corrupted state. With WAL, upon restart, the database system enters "Crash Recovery" mode. It replays the log to "Redo" committed transactions that hadn't reached the data files and "Undo" uncommitted transactions that were partially written.37

这解决了“页面撕裂”问题,即在将数据库块写入磁盘的中途发生系统崩溃。没有WAL,数据库将处于损坏状态。有了WAL,重启后,数据库系统进入“崩溃恢复”模式。它重放日志以“重做”尚未到达数据文件的已提交事务,并“撤销”部分写入的未提交事务 37。

7.2 Implementation in Modern Systems

7.2 现代系统中的实现

In modern systems like PostgreSQL, the WAL is implemented as a set of sequential segment files (typically 16MB each). The WAL records the "Delta" (the change) rather than the entire page. Because sequential writes to a log file are significantly faster than random writes to data pages, WAL also serves as a performance optimization.

在像PostgreSQL这样的现代系统中,WAL被实现为一组顺序段文件(通常每个16MB)。WAL记录的是“Delta”(变化)而不是整个页面。由于对日志文件的顺序写入比对数据页面的随机写入快得多,WAL也作为一种性能优化手段。

Furthermore, WAL is central to replication. In cloud architectures, rather than replicating the raw data files (which is bandwidth-heavy), the primary node sends the stream of WAL records to the standby replicas. The replicas replay these records to stay synchronized. This "Log Shipping" mechanism allows for Point-in-Time Recovery (PITR), where an administrator can restore a database to a specific second in history by replaying the WAL up to that exact moment.39

此外,WAL是复制的核心。在云架构中,主节点不是复制原始数据文件(这会占用大量带宽),而是将WAL记录流发送到备用副本。副本重放这些记录以保持同步。这种“日志传送”机制允许进行时间点恢复(PITR),管理员可以通过重放WAL直到确切时刻,将数据库恢复到历史上的特定秒 39。

8. Conclusions: Architectural Convergence and Stratification

8. 结论:架构融合与分层

The inquiry "Do cloud providers use IBM Mainframes?" yields a nuanced conclusion. While hyperscalers (AWS, Google, Azure) rely primarily on commodity hardware and distributed software architectures rather than physical mainframes for their general compute pools, the principles of the mainframe have been deconstructed and reimplemented in software, while specific high-value workloads continue to demand specialized engineered systems.

关于“云提供商是否使用IBM大型机?”的探究得出了一个微妙的结论。虽然超大规模云提供商(AWS、Google、Azure)在其通用计算池中主要依赖商用硬件和分布式软件架构,而不是物理大型机,但大型机的原则已被解构并在软件中重新实现,而特定的高价值工作负载继续需要专门的工程化系统。

Workload Stratification: Core banking ledgers, settlement systems, and insurance backends largely remain on mainframes or migrate to Engineered Systems like Oracle Exadata. These workloads require hardware-level fault tolerance and deterministic latency that pure cloud architectures struggle to guarantee economically.1
工作负载分层: 核心银行账本、结算系统和保险后端主要保留在大型机上或迁移到像Oracle Exadata这样的工程化系统。这些工作负载需要纯云架构难以在经济上保证的硬件级容错和确定性延迟 1。

Architectural Convergence: The cloud is adopting mainframe concepts through software. ZFS implements the "end-to-end integrity" that mainframe hardware used to provide.3 Oracle Exadata physically reintegrates compute and storage to solve the I/O latency problems inherent in distributed systems, effectively building a "cloud mainframe".17 Even billing models are evolving, with "Reserved Instances" mimicking the predictable cost structures of mainframe capacity planning.
架构融合: 云正在通过软件采纳大型机概念。ZFS实现了大型机硬件过去提供的“端到端完整性” 3。Oracle Exadata物理地重新整合了计算和存储,以解决分布式系统固有的I/O延迟问题,有效地构建了“云端大型机” 17。甚至计费模型也在演变,“预留实例”模仿了大型机容量规划的可预测成本结构。

The Hybrid Future: The industry is moving towards a hybrid integration model. "Cloud@Customer" deployments bring the cloud's operational flexibility to the mainframe's physical location, bridging the gap between data sovereignty requirements and the desire for cloud elasticity. The future is not a total replacement of the mainframe but a rigorous application of the "Right Workload, Right Platform" strategy.16
混合未来: 行业正朝着混合集成模型发展。“Cloud@Customer”部署将云的运营灵活性带到了大型机的物理位置,弥合了数据主权要求与云弹性需求之间的差距。未来不是完全取代大型机,而是严格应用“合适的工作负载,合适的平台”策略 16。

Works cited

Exadata Cloud Increases Financial Services Insight and Agility - Oracle, accessed December 12, 2025,

Mainframe vs. Cloud: Who Wins the Security Battle? | by Thomas Joseph | Medium, accessed December 12, 2025,

Zettabyte reliability with flexible end-to-end data integrity - IEEE Xplore, accessed December 12, 2025,

Cloud Terminology | CloudBank, accessed December 12, 2025,

ATOMIC CLOUD BILLING TERMINOLOGY, accessed December 12, 2025,

Understanding Your Cloud Bill: A Beginner's Guide - Umbrella, accessed December 12, 2025,

Reliability analysis of a hardware and software fault tolerant parallel processor - IEEE Xplore, accessed December 12, 2025,

Your Mainframe is the Original Cloud - Cloud Computing | Precisely, accessed December 12, 2025,

What Is Fault Tolerance? | Creating a Fault-tolerant System - Fortinet, accessed December 12, 2025,

(PDF) A Comparative Analysis of Hardware and Software Fault Tolerance: Impact on Software Reliability Engineering - ResearchGate, accessed December 12, 2025,

Mainframe Vs Cloud Computing: Know the Similarities and Differences, accessed December 12, 2025,

Fault tolerance - Wikipedia, accessed December 12, 2025,

Cloud and Mainframe: A Perfect Match - SHARE'd Intelligence, accessed December 12, 2025,

Oracle MAA with ExaCC and ExaCM, accessed December 12, 2025,

Oracle Exadata Database Service, accessed December 12, 2025,

Exadata Cloud@Customer - Oracle, accessed December 12, 2025,

CELL_OFFLOAD_PROCESSING - Oracle Help Center, accessed December 12, 2025,

Difference between Offloading and Smart scan - Oracle Forums, accessed December 12, 2025,

Offloading Data Search and Retrieval Processing - Oracle Help Center, accessed December 12, 2025,

3.1.2 cell_offload_plan_display - Oracle Help Center, accessed December 12, 2025,

High Availability Overview and Best Practices - Oracle Help Center, accessed December 12, 2025,

Oracle Best Practices for High Availability, accessed December 12, 2025,

Understanding the Single-Writer Principle - Software Engineering Stack Exchange, accessed December 12, 2025,

Scale applications using multi-Region Amazon EKS and Amazon Aurora Global Database: Part 1 - AWS, accessed December 12, 2025,

Consistency level choices - Azure Cosmos DB - Microsoft Learn, accessed December 12, 2025,

Consistency Guarantees in Distributed Systems Explained Simply | by Kousik Nath | Medium, accessed December 12, 2025,

Locking In Distributed Systems. Content | by Himani Prasad - Medium, accessed December 12, 2025,

How to do distributed locking | Hacker News, accessed December 12, 2025,

Decrement counter with high concurrency in distributed system - Software Engineering Stack Exchange, accessed December 12, 2025,

How to break through the distributed lock performance bottleneck of large model storage?, accessed December 12, 2025,

End-to-end Data Integrity for File Systems: A ZFS Case Study - USENIX, accessed December 12, 2025,

true end to end data integrity? | TrueNAS Community, accessed December 12, 2025,

End-to-end Data Integrity for File Systems: A ZFS Case Study - Computer Sciences Dept., accessed December 12, 2025,

OpenZFS - Data Security vs. Integrity - Klara Systems, accessed December 12, 2025,

ZFS Essentials: Copy-on-write & Snapshots - Open-E, accessed December 12, 2025,

Why The ZFS Copy On Write File System Is Better Than A Journaling One - YouTube, accessed December 12, 2025,

What You Need to Know About Write-Ahead Logging (WAL) - CelerData, accessed December 12, 2025,

Write-ahead logging - Wikipedia, accessed December 12, 2025,

Write-Ahead Logs: The Unsung Hero of Database Reliability — How a Simple Logging Pattern Powers… - Medium, accessed December 12, 2025,