数据库可靠性:中英双语整理

最后更新于:2025-12-12 14:57:47

Reliability Architecture of PostgreSQL 18 and MongoDB 8 in Single-Node Debian/ZFS Environments

单节点 Debian/ZFS 环境下 PostgreSQL 18 与 MongoDB 8 的可靠性架构

1. Introduction to Data Integrity Paradigms

1. 数据完整性范式介绍

1.1 Defining Data Integrity and Its Scope

1.1 定义数据完整性及其范围

Data integrity is fundamentally the maintenance and assurance of data accuracy and consistency over its entire life-cycle. It serves as a critical aspect of the design, implementation, and usage of any system that stores, processes, or retrieves data.1 While the term is broad in scope and may have widely different meanings depending on the specific context—even under the same general umbrella of computing—its core objective remains constant. Data integrity is the direct opposite of data corruption. The overall intent of any data integrity technique is to ensure that data is recorded exactly as intended and, upon later retrieval, to ensure the data remains identical to when it was originally recorded.1

数据完整性从根本上讲是在数据的整个生命周期中维护并确保其准确性和一致性。它是任何存储、处理或检索数据的系统在设计、实现和使用中的关键方面 1。虽然该术语涉及范围广泛,即使在计算这一相同的大背景下,根据具体上下文的不同,也可能具有截然不同的含义,但其核心目标保持不变。数据完整性是数据损坏的直接对立面。任何数据完整性技术的总体目的都是确保数据完全按照预期进行记录,并且在随后的检索中,确保数据与最初记录时保持完全一致 1。

Any unintended changes to data resulting from a storage, retrieval, or processing operation constitute a failure of data integrity. These failures can stem from various sources, including malicious intent, unexpected hardware failure, and human error.1 Depending on the specific data involved, the manifestation of such failures can range from benign anomalies, such as a single pixel in an image appearing a different color than originally recorded, to catastrophic events like the loss of vacation pictures or business-critical information.1 It is crucial to distinguish data integrity from data security. Data security is the discipline of protecting data from unauthorized parties. If changes result from unauthorized access, it may be a failure of data security; however, the alteration itself represents a failure of data integrity.1

任何由存储、检索或处理操作导致的数据非预期更改都构成了数据完整性的失败。这些失败可能源于多种来源,包括恶意意图、意外硬件故障和人为错误 1。根据所涉及的具体数据,此类失败的表现形式可能从良性异常(例如图像中单个像素显示的颜色与最初记录不同)到灾难性事件(例如丢失度假照片或业务关键信息)不等 1。区分数据完整性和数据安全至关重要。数据安全是保护数据免受未经授权方侵害的学科。如果更改是由未经授权的访问引起的,那可能是数据安全的失败;然而,更改本身代表了数据完整性的失败 1。

Organizations can maintain data integrity through the implementation of integrity constraints, which define the strict rules and procedures surrounding actions such as the deletion, insertion, and update of information.2 These mechanisms ensure that the database rejects mutually exclusive possibilities and preserves the fidelity of the recorded information.1

组织可以通过实施完整性约束来维护数据完整性,这些约束定义了围绕信息删除、插入和更新等操作的严格规则和程序 2。这些机制确保数据库拒绝互斥的可能性,并保持记录信息的保真度 1。

1.2 The Dichotomy of Physical and Logical Integrity

1.2 物理完整性与逻辑完整性的二分法

To understand reliability in a storage stack involving Debian, ZFS, and databases like PostgreSQL or MongoDB, one must distinguish between physical and logical integrity. Data corruption can manifest itself as either physical or logical corruption.3

要理解涉及 Debian、ZFS 以及 PostgreSQL 或 MongoDB 等数据库的存储栈中的可靠性,必须区分物理完整性和逻辑完整性。数据损坏可表现为物理损坏或逻辑损坏 3。

Physical Integrity

物理完整性

Physical integrity refers to protecting the accuracy, correctness, and wholeness of data when it is stored and retrieved. This form of integrity is typically compromised by environmental and hardware issues such as power outages, storage erosion, natural disasters, or hackers targeting database functions, all of which prevent accurate data storage and retrieval.2 Physical data integrity is the preservation of data completeness, accessibility, and correctness while data is at rest or in transit.4 In the specific context of database blocks, physical corruption manifests as an invalid checksum or header, or when the block structure identifies itself as damaged.3

物理完整性是指在存储和检索数据时保护其准确性、正确性和完整性。这种形式的完整性通常会受到环境和硬件问题的破坏,例如断电、存储侵蚀、自然灾害或针对数据库功能的黑客攻击,所有这些都会阻碍准确的数据存储和检索 2。物理数据完整性是指在数据静止或传输过程中保持数据的完整性、可访问性和正确性 4。在数据库块的具体语境中,物理损坏表现为无效的校验和或头部,或者当块结构标识自身已损坏时 3。

Logical Integrity

逻辑完整性

Logical integrity focuses on the preservation of data consistency and completeness when it is accessible by many stakeholders and applications across departments, disciplines, and locations.4 Logical corruption happens when a data block has a valid checksum and physically appears correct, but the content inside the block is logically inconsistent.3 For example, accidental or incorrect modification of application data by a user or application is a primary cause of logical corruption. A scenario might involve an engineer who performs an update but forgets to formulate the predicate such that it updates only a single record, and instead accidentally updates (and commits) changes to thousands of records.5

逻辑完整性侧重于在跨部门、学科和地点的许多利益相关者和应用程序访问数据时,保持数据的一致性和完整性 4。逻辑损坏发生在数据块具有有效的校验和且物理上看起来正确,但块内的内容在逻辑上不一致的情况下 3。例如,用户或应用程序对应用程序数据的意外 or 错误修改是逻辑损坏的主要原因。一个场景可能涉及工程师执行更新操作,但忘记正确制定谓词以使其仅更新单条记录,反而意外更新(并提交)了对数千条记录的更改 5。

Logical integrity ensures that data remains unchanged while being used in different ways through relational databases and can be enforced in both hierarchical and relational systems.2 It comprises four different formats, including entity integrity, which relies on primary keys and features of relation systems that store data within tables.2 Furthermore, interblock corruption, where the corruption occurs between blocks rather than within a single block, can only be classified as a logical corruption.3

逻辑完整性确保数据在通过关系数据库以不同方式使用时保持不变,并且可以在分层和关系系统中强制执行 2。它包含四种不同的格式,其中包括实体完整性,它依赖于主键和将数据存储在表中的关系系统的特性 2。此外,块间损坏(即损坏发生在块之间而不是单个块内)只能被归类为逻辑损坏 3。

2. The Threat Landscape: Silent Data Corruption

2. 威胁形势:静默数据损坏

2.1 Characteristics of Silent Data Corruption

2.1 静默数据损坏的特征

One of the biggest challenges in designing storage systems is providing the reliability and availability that users expect. Once their data is stored, users expect it to be persistent forever and perpetually available. However, an important threat to reliable storage of data is silent data corruption.6 Unlike explicit failures where a drive reports an error, silent corruption enters the data path undetected. To develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. Large-scale studies have been conducted to analyze corruption instances recorded in production storage systems.6

设计存储系统的最大挑战之一是提供用户期望的可靠性和可用性。一旦数据被存储,用户就期望它永远持久并永久可用。然而,数据可靠存储的一个重要威胁是静默数据损坏 6。与驱动器报告错误的显式故障不同,静默损坏会在未被检测到的情况下进入数据路径。为了开发针对数据损坏的合适保护机制,必须了解其特征。已经进行了大规模研究,以分析生产存储系统中记录的损坏实例 6。

In a comprehensive study of 1.53 million disk drives over a period of 41 months, researchers analyzed three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. They focused on checksum mismatches since they occur the most. The study found more than 400,000 instances of checksum mismatches over the 41-month period.6

在一项针对 153 万个磁盘驱动器进行为期 41 个月的综合研究中,研究人员分析了三类损坏:校验和不匹配、身份差异和奇偶校验不一致。他们重点关注校验和不匹配,因为这种情况发生得最频繁。研究发现在这 41 个月期间,发现了超过 400,000 个校验和不匹配的实例 6。

The analysis revealed several critical trends regarding the nature of these corruptions:

分析揭示了关于这些损坏性质的几个关键趋势:

Hardware Vulnerability: Nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives.6 This distinction is vital for architects selecting hardware for database servers.
硬件脆弱性:近线磁盘(及其适配器)产生校验和不匹配的频率比企业级磁盘驱动器高出一个数量级 6。这一区别对于为数据库服务器选择硬件的架构师至关重要。

Locality: Checksum mismatches within the same disk are not independent events. They show high spatial and temporal locality, meaning that if one error occurs, the probability of adjacent errors in space or time increases significantly.6
局部性:同一磁盘内的校验和不匹配不是独立事件。它们显示出高度的空间和时间局部性,这意味着如果发生一个错误,在空间或时间上发生相邻错误的概率会显著增加 6。

Systemic Correlation: Checksum mismatches across different disks in the same storage system are not independent.6 This suggests that corruption can be systemic, potentially affecting multiple drives simultaneously within a single storage array or server.
系统相关性:同一存储系统中不同磁盘之间的校验和不匹配不是独立的 6。这表明损坏可能是系统性的,可能同时影响单个存储阵列 or 服务器内的多个驱动器。

Silent data corruption is often referred to as the "backup killer." If corruption occurs and is not detected by the file system or database, the corrupted data may be backed up. When a restore is eventually needed, the backup itself contains the corruption, rendering it useless for recovering valid data.7

静默数据损坏通常被称为“备份杀手”。如果发生损坏且未被文件系统或数据库检测到,则损坏的数据可能会被备份。当最终需要恢复时,备份本身包含损坏,导致其无法用于恢复有效数据 7。

2.2 Memory Corruption vs. Disk Corruption

2.2 内存损坏与磁盘损坏

While disk media, firmware, controllers, and the buses that connect them can corrupt data, higher-level storage software is responsible for detecting and recovering from these corruptions.8 File and storage systems have evolved various techniques to handle disk corruption, such as checksums and redundancy (mirrored or parity-based form).8

虽然磁盘介质、固件、控制器以及连接它们的总线可能会损坏数据,但更高级别的存储软件负责检测这些损坏并从中恢复 8。文件和存储系统已经发展出各种技术来处理磁盘损坏,例如校验和及冗余(镜像或基于奇偶校验的形式)8。

However, the effects of memory corruption on data integrity have historically been largely ignored in file system design. Hardware-based memory corruption occurs as both transient soft errors and repeatable hard errors due to variety of radiation mechanisms. Recent studies have confirmed their presence in modern systems. Furthermore, software bugs can lead to "wild writes" into random memory contents, polluting memory.8

然而,内存损坏对数据完整性的影响在历史上一直主要被文件系统设计所忽略。由于各种辐射机制,基于硬件的内存损坏会以瞬态软错误和可重复硬错误的形式发生。最近的研究证实了它们在现代系统中的存在。此外,软件错误可能导致对随机内存内容的“疯狂写入”,从而污染内存 8。

Modern file systems cache a large amount of data in memory for performance. As memory capacity grows, file systems may cache data for a long time, making them increasingly susceptible to memory corruptions.9 A study focusing on Sun's ZFS performed fault injection experiments to understand this dynamic. The results showed that while ZFS is robust to a wide range of disk faults, it fails to maintain data integrity in the presence of memory corruptions. A single bit flip in memory has non-negligible chances of causing failures, including reading/writing corrupt data or causing system crashes.8 This reveals a critical gap: data integrity at the memory level is not preserved as robustly as on-disk integrity.9

现代文件系统为了性能会在内存中缓存大量数据。随着内存容量的增长,文件系统可能会长时间缓存数据,使其越来越容易受到内存损坏的影响 9。一项针对 Sun ZFS 的研究进行了故障注入实验,以了解这种动态。结果表明,虽然 ZFS 对广泛的磁盘故障具有鲁棒性,但在存在内存损坏的情况下,它无法保持数据完整性。内存中的单位反转有不可忽视的几率导致故障,包括读取/写入损坏的数据或导致系统崩溃 8。这揭示了一个关键差距:内存级别的数据完整性不如磁盘上的完整性保存得那样稳健 9。

3. The Physical Layer Defense: ZFS Implementation

3. 物理层防御:ZFS 实现

3.1 End-to-End Checksums and Self-Healing

3.1 端到端校验和与自愈

ZFS employs a mechanism known as end-to-end checksums to combat silent data corruption. Unlike traditional systems that might rely on the drive's internal error correction, ZFS detects data corruption upon reading from the media. Checksums are stored in a generic block pointer, physically separate from the data block itself, which prevents a single localized corruption from invalidating both the data and its verification hash.9

ZFS 采用一种称为端到端校验和的机制来对抗静默数据损坏。与可能依赖驱动器内部纠错的传统系统不同,ZFS 在从介质读取时检测数据损坏。校验和存储在通用的块指针中,在物理上与数据块本身分开,这防止了单个局部损坏同时使数据及其验证哈希无效 9。

The advantages of end-to-end checksums in ZFS include:

ZFS 中端到端校验和的优点包括:

Detection on Read: It detects data corruption immediately upon reading from the media.10
读取时检测:它在从介质读取时立即检测数据损坏 10。

Automatic Repair: Blocks that are detected as corrupt are automatically repaired if possible. This self-healing capability relies on RAID protection in suitably configured pools or redundant copies (managed via the zfs copies property).10 If a checksum mismatch occurs, ZFS uses the redundancy (Mirror, RAID-Z) or up to three copies (ditto blocks) to recover the correct data.9
自动修复:如果可能,被检测为损坏的块会自动修复。这种自愈能力依赖于配置适当的池中的 RAID 保护或冗余副本(通过 zfs copies 属性管理)10。如果发生校验和不匹配,ZFS 使用冗余(镜像,RAID-Z)或最多三个副本(ditto 块)来恢复正确的数据 9。

Scrubbing: Periodic scrubs can check data to detect and repair latent media degradation (bit rot) and corruption from other sources.10 This proactive measure ensures that errors are caught before they accumulate beyond the system's ability to repair them.
清理(Scrubbing):定期清理可以检查数据,以检测和修复潜在的介质退化(位腐烂)以及来自其他来源的损坏 10。这种主动措施确保错误在积累到超出系统修复能力之前被捕获。

Replication Security: Checksums on ZFS replication streams (using zfs send and zfs receive) ensure the data received is not corrupted by intervening storage or transport mechanisms.10
复制安全性:ZFS 复制流(使用 zfs send 和 zfs receive)上的校验和确保接收到的数据不会被中间存储或传输机制损坏 10。

3.2 Transactional Integrity and Copy-On-Write

3.2 事务完整性与写时复制

ZFS utilizes Copy-On-Write (COW) transactions to manage data updates. This design ensures that the disk image is always consistent. When data is modified, it is written to a new block rather than overwriting the existing one. Only after the write is confirmed successful are the pointers updated. This mechanism, combined with checksums, contributes to ZFS's "provable end-to-end data integrity".9

ZFS 利用写时复制 (COW) 事务来管理数据更新。这种设计确保磁盘映像始终一致。当数据被修改时,它被写入一个新的块,而不是覆盖现有的块。只有在确认写入成功后,指针才会更新。这种机制与校验和相结合,促成了 ZFS 的“可证明的端到端数据完整性” 9。

Other file systems with native end-to-end checksumming and integrity checking include Microsoft's Resilient File System (ReFS). In the general IO path, error correcting codes (ECC) and cyclic redundancy checks (CRCs) will catch the majority of errors. RAID types that run checksums also help to catch errors, typically protecting storage arrays.7 However, the integration of these features within ZFS—specifically the combination of COW, checksums in pointers, and storage pool management—provides a robust defense against physical corruption.9

其他具有原生端到端校验和及完整性检查的文件系统包括微软的弹性文件系统 (ReFS)。在一般的 IO 路径中,纠错码 (ECC) 和循环冗余校验 (CRC) 将捕获大多数错误。运行校验和的 RAID 类型也有助于捕获错误,通常用于保护存储阵列 7。然而,ZFS 内部这些功能的集成——特别是 COW、指针中的校验和以及存储池管理的结合——提供了针对物理损坏的强大防御 9。

4. Database Layer I: PostgreSQL 18 Architecture

4. 数据库层 I:PostgreSQL 18 架构

4.1 Data Page Checksums: Configuration and Behavior

4.1 数据页校验和:配置与行为

PostgreSQL employs data page checksums to detect corruption within database files. The system uses a fast FNV-1a hash to calculate checksums, which is optimized for performance.11

PostgreSQL 采用数据页校验和来检测数据库文件内的损坏。系统使用一种针对性能优化的快速 FNV-1a 哈希来计算校验和 11。

Default Enablement in Version 18

版本 18 中的默认启用

A significant shift in reliability policy has occurred with PostgreSQL 18. Version 18 enables data-checksums by default. In earlier versions, initializing a cluster with initdb required the specific flag --data-checksums to activate this feature. Now, the default behavior is to enable them, and users wishing to opt-out must explicitly use the new option --no-data-checksums.11 This change reflects a prioritization of data integrity, with the release notes acknowledging that while the overhead is non-zero, it is accepted for the benefit of data integrity.11 Benchmarking studies have found the penalty is usually less than 2% for normal workloads on typical hardware.11

PostgreSQL 18 的可靠性策略发生了重大转变。版本 18 默认启用 data-checksums。在早期版本中,使用 initdb 初始化集群需要特定的标志 --data-checksums 来激活此功能。现在,默认行为是启用它们,希望退出的用户必须显式使用新选项 --no-data-checksums 11。这一变化反映了对数据完整性的重视,发布说明承认虽然开销非零,但为了数据完整性的益处而接受了它 11。基准测试研究发现,在典型硬件上,对于正常工作负载,其代价通常小于 2% 11。

Operational Management

操作管理

Data checksums are a full cluster-level property and cannot be specified individually for databases or tables.13 The state of checksums can be verified by viewing the read-only configuration variable data_checksums using the command SHOW data_checksums. A result of "ON" indicates that data-page checksums are active.11

数据校验和是整个集群级别的属性,不能为数据库 or 表单独指定 13。校验和的状态可以通过使用命令 SHOW data_checksums 查看只读配置变量 data_checksums 来验证。结果为“ON”表示数据页校验和处于活动状态 11。

To modify this setting on an existing cluster, PostgreSQL provides the pg_checksums utility. This tool can check, enable, or disable checksums. However, it is important to note that checksums cannot be toggled while the server is running; the cluster must be shut down cleanly before using pg_checksums.11

为了在现有集群上修改此设置,PostgreSQL 提供了 pg_checksums 工具。该工具可以检查、启用或禁用校验和。但是,必须注意,无法在服务器运行时切换校验和;在使用 pg_checksums 之前,必须干净地关闭集群 11。

Error Handling and Recovery

错误处理与恢复

When attempting to recover from page corruptions, the strict validation of checksums can sometimes impede recovery efforts. In such scenarios, it may be necessary to bypass checksum protection. Administrators can achieve this by temporarily setting the configuration parameter ignore_checksum_failure.13 This allows the database to read potentially corrupt pages to salvage data, though it carries the risk of processing logically incorrect information.

当尝试从页面损坏中恢复时,校验和的严格验证有时会阻碍恢复工作。在这种情况下,可能需要绕过校验和保护。管理员可以通过临时设置配置参数 ignore_checksum_failure 来实现这一点 13。这允许数据库读取可能损坏的页面以挽救数据,尽管这通过处理逻辑上不正确的信息带来了风险。

4.2 Write-Ahead Log (WAL) and Durability

4.2 预写式日志 (WAL) 与持久性

The Write-Ahead Log (WAL) is the backbone of PostgreSQL's durability. Where data structures are persistent, WAL records are written to allow recent changes to be accurately rebuilt at crash recovery.14

预写式日志 (WAL) 是 PostgreSQL 持久性的支柱。在数据结构持久化的地方,写入 WAL 记录以允许在崩溃恢复时准确重建最近的更改 14。

CRC Algorithm Implementation

CRC 算法实现

WAL records are protected by cyclic redundancy checks. Documentation for PostgreSQL has recently been updated to correct a long-standing inaccuracy: while previous texts stated the use of CRC-32, the system has actually used CRC-32C since version 9.5 (commit 5028f22). The "C" variant of CRC-32 is optimized for performance and hardware acceleration.15 This correction also standardizes the nomenclature in documentation to "CRC-32C" (with a dash), aligning with conventions used by Wikipedia.15

WAL 记录受循环冗余校验保护。PostgreSQL 的文档最近已更新,以纠正一个长期存在的不准确之处:虽然之前的文本陈述使用的是 CRC-32,但系统实际上自 9.5 版本(提交 5028f22)以来一直使用 CRC-32C。CRC-32 的“C”变体针对性能和硬件加速进行了优化 15。此更正还将文档中的命名标准化为“CRC-32C”(带连字符),与维基百科使用的惯例保持一致 15。

Scope of Checksum Protection

校验和保护范围

The reliability mechanisms extend beyond just the main data tables:

可靠性机制不仅扩展到主数据表:

WAL Records: Protected by CRC-32C.11
WAL 记录:受 CRC-32C 保护 11。

Two-Phase State Files: Individual state files in pg_twophase are protected by CRC-32C.14
两阶段状态文件:pg_twophase 中的单个状态文件受 CRC-32C 保护 14。

Exclusions: It is important to note that temporary data files used in larger SQL queries for sorts, materializations, and intermediate results are not currently checksummed. Furthermore, WAL records are not written for changes to those files.14
例外:必须注意,用于大型 SQL 查询中的排序、物化和中间结果的临时数据文件目前不进行校验和检查。此外,不会为这些文件的更改写入 WAL 记录 14。

4.3 New Features in PostgreSQL 18 Affecting Storage

4.3 PostgreSQL 18 中影响存储的新功能

PostgreSQL 18 introduces several features and changes that interact with the storage subsystem and reliability protocols 12:

PostgreSQL 18 引入了若干与存储子系统和可靠性协议交互的功能和更改 12:

Asynchronous I/O (AIO): An AIO subsystem is now available that can improve the performance of sequential scans, bitmap heap scans, vacuums, and other operations. This changes the I/O pattern interacting with the underlying OS and filesystem.12
异步 I/O (AIO):现在可以使用 AIO 子系统,它可以提高顺序扫描、位图堆扫描、清理(vacuums)和其他操作的性能。这改变了与底层操作系统和文件系统交互的 I/O 模式 12。

UUIDv7: The system now supports the uuidv7() function for generating timestamp-ordered UUIDs. This has implications for index structure and data locality compared to random UUIDs.12
UUIDv7:系统现在支持 uuidv7() 函数,用于生成按时间戳排序的 UUID。与随机 UUID 相比,这对索引结构和数据局部性有影响 12。

Migration Considerations: pg_upgrade requires matching cluster checksum settings. The new initdb option --no-data-checksums is explicitly useful to upgrade non-checksum old clusters to Version 18, ensuring compatibility without forcing a checksum enablement during upgrade.12
迁移注意事项:pg_upgrade 要求集群校验和设置匹配。initdb 的新选项 --no-data-checksums 对于将未校验和的旧集群升级到版本 18 特别有用,确保了兼容性,而无需在升级期间强制启用校验和 12。

Virtual Generated Columns: These columns compute their values during read operations and are now the default for generated columns.12
虚拟生成列:这些列在读取操作期间计算其值,并且现在是生成列的默认设置 12。

5. Database Layer II: MongoDB 8 & WiredTiger Durability

5. 数据库层 II:MongoDB 8 & WiredTiger 持久性

5.1 WiredTiger Storage Engine Mechanics

5.1 WiredTiger 存储引擎机制

MongoDB utilizes the WiredTiger storage engine, which relies on a combination of checkpoints and journaling to ensure data durability and integrity.

MongoDB 利用 WiredTiger 存储引擎,该引擎依赖检查点和日志记录的组合来确保数据持久性和完整性。

Journaling and Recovery

日志记录与恢复

As applications write data, MongoDB records the data in the storage layer. To provide durable data, WiredTiger records these changes in a journal. This journal acts as a temporary record of all operations, ensuring durability and consistency. If the system crashes before the data is written to the disk (flushed from memory), MongoDB can use the journal to replay the operations and recover the database to a consistent state, preventing data loss.16

当应用程序写入数据时,MongoDB 将数据记录在存储层。为了提供持久数据,WiredTiger 将这些更改记录在日志中。该日志充当所有操作的临时记录,确保持久性和一致性。如果系统在数据写入磁盘(从内存刷新)之前崩溃,MongoDB 可以使用日志重放操作并将数据库恢复到一致状态,防止数据丢失 16。

Journal File Management

日志文件管理

WiredTiger creates a new journal file approximately every 100 MB of data, due to MongoDB using a journal file size limit of 100 MB. Crucially, the system has a time-based commit mechanism: it commits to the journal at every 100 milliseconds. This frequency is controlled by the parameter storage.journal.commitIntervalMs.18

WiredTiger 大约每 100 MB 数据创建一个新的日志文件,这是由于 MongoDB 使用 100 MB 的日志文件大小限制。至关重要的是,系统有一个基于时间的提交机制:它每 100 毫秒向日志提交一次。此频率由参数 storage.journal.commitIntervalMs 控制 18。

Checkpoints

检查点

In parallel with journaling, WiredTiger uses checkpoints. Checkpoints are the mechanism by which the storage engine provides durable data by writing snapshot data to disk. This process interacts with the journal to manage the volume of data that must be replayed upon recovery.17

与日志记录并行,WiredTiger 使用检查点。检查点是存储引擎通过将快照数据写入磁盘来提供持久数据的机制。此过程与日志交互,以管理恢复时必须重放的数据量 17。

5.2 Write Concern: Balancing Performance and Reliability

5.2 写入关注:平衡性能与可靠性

Write concern is the configuration mechanism that describes the level of acknowledgment requested from MongoDB for write operations. It defines how many nodes in a replica set must confirm a write before it is considered successful. By configuring Write Concern appropriately, developers can balance between performance and data reliability.16

写入关注(Write Concern)是描述 MongoDB 对写入操作所请求的确认级别的配置机制。它定义了副本集中必须有多少节点确认写入,该写入才被视为成功。通过适当地配置写入关注,开发人员可以在性能和数据可靠性之间取得平衡 16。

The Write Concern specification includes the following fields:

写入关注规范包括以下字段:

w: The number of instances that must acknowledge the write.
w:必须确认写入的实例数量。

j: A boolean requesting acknowledgment that the write operation has been written to the on-disk journal.
j:一个布尔值,请求确认写入操作已被写入磁盘上的日志。

wtimeout: A time limit to prevent write operations from blocking indefinitely.19
wtimeout:防止写入操作无限期阻塞的时间限制 19。

Durability Levels and Scenarios

持久性级别与场景

The following table summarizes the primary write concern configurations and their implications for reliability 16:

下表总结了主要的写入关注配置及其对可靠性的影响 16:

Real-Time Persistence

实时持久性

For use cases requiring writes to be persisted to disk as frequently as possible (in real-time), utilizing j: true (or "journal acknowledged") is the solution. This ensures writes are acknowledged only after they are written to the journal and flushed to disk. This can be set as the default on the connection or on individual writes, achieving durability without changing startup parameters.20

对于要求尽可能频繁地(实时)将写入持久保存到磁盘的用例,利用 j: true(或“日志确认”)是解决方案。这确保写入仅在被写入日志并刷新到磁盘后才被确认。这可以设置为连接的默认值或在单个写入上设置,从而在不更改启动参数的情况下实现持久性 20。

Implicit Defaults in Replica Sets

副本集中的隐式默认值

The implicit default write concern is w: majority. This ensures durability by requiring replica sets to wait for on-disk journaling by default (controlled by writeConcernMajorityJournalDefault). However, an edge case exists: if the number of data-bearing voting members is not greater than the voting majority (e.g., due to Arbiters), the default reverts to { w: 1 }.19

隐式默认写入关注是 w: majority。这通过要求副本集默认等待磁盘日志记录(由 writeConcernMajorityJournalDefault 控制)来确保持久性。但是,存在一个边缘情况:如果承载数据的投票成员数量不大于投票多数(例如,由于仲裁者),默认值将恢复为 { w: 1 } 19。

6. Integrated Reliability: Debian/ZFS, PostgreSQL, and MongoDB

6. 集成可靠性:Debian/ZFS、PostgreSQL 和 MongoDB

6.1 The Stacked Defense Against "Backup Killers"

6.1 针对“备份杀手”的分层防御

The combination of the technologies discussed creates a layered defense strategy against data loss, particularly silent data corruption (the "backup killer").7

所讨论技术的结合创建了一个针对数据丢失,特别是静默数据损坏(“备份杀手”)的分层防御策略 7。

Layer 1: The File System (ZFS)

第一层:文件系统 (ZFS)

ZFS serves as the foundational layer. By calculating and verifying checksums on every read operation, ZFS ensures that the application layer (PostgreSQL or MongoDB) never receives corrupt data from the disk. If a checksum mismatch occurs, ZFS attempts to self-heal using redundant copies (zfs copies or RAID parity). If repair is impossible, it returns an I/O error rather than silent corruption.9 This prevents the database from processing—and subsequently backing up—invalid data.

ZFS 作为基础层。通过在每次读取操作中计算和验证校验和,ZFS 确保应用层(PostgreSQL 或 MongoDB)永远不会从磁盘接收损坏的数据。如果发生校验和不匹配,ZFS 会尝试使用冗余副本(zfs copies 或 RAID 奇偶校验)进行自愈。如果无法修复,它会返回 I/O 错误而不是静默损坏 9。这防止了数据库处理——并随后备份——无效数据。

Layer 2: The Database Application (PostgreSQL/MongoDB)

第二层:数据库应用程序 (PostgreSQL/MongoDB)

While ZFS protects against disk corruption, it is less resilient to memory corruption.8 This is where application-level checksums become critical.

PostgreSQL: With data-checksums enabled by default in Version 18, PostgreSQL verifies pages as they are read into the buffer pool. If a memory error flips a bit in a page after ZFS checks it but before the database uses it (or during previous in-memory processing before a write), PostgreSQL's checksums provide a secondary integrity check.11 The protection of WAL records with CRC-32C ensures that recovery logs are also validated.14
PostgreSQL:随着版本 18 默认启用 data-checksums,PostgreSQL 在页面读入缓冲池时对其进行验证。如果内存错误在 ZFS 检查页面之后但在数据库使用它之前(或者在写入之前的先前内存处理期间)翻转了页面中的位,PostgreSQL 的校验和将提供二级完整性检查 11。使用 CRC-32C 保护 WAL 记录确保了恢复日志也得到验证 14。

MongoDB: WiredTiger's use of journaling ensures that operations are durable. By setting { j: true }, the application forces the data through the filesystem cache to the physical disk (and ZFS validation logic), bridging the gap between volatile memory and persistent storage.16
MongoDB:WiredTiger 对日志记录的使用确保了操作的持久性。通过设置 { j: true },应用程序强制数据通过文件系统缓存进入物理磁盘(以及 ZFS 验证逻辑),弥合了易失性内存和持久存储之间的差距 16。

Layer 3: Hardware and Transport

第三层:硬件和传输

In the IO path, error correcting codes (ECC) and CRCs catch the majority of transmission errors before they reach the higher software layers.7

在 IO 路径中,纠错码 (ECC) 和 CRC 在传输错误到达更高软件层之前捕获了大多数错误 7。

6.2 Operational Recommendations and Maintenance

6.2 操作建议与维护

To maximize reliability in this specific single-node Debian/ZFS environment, the following practices are derived from the mechanisms analyzed:

为了在这个特定的单节点 Debian/ZFS 环境中最大化可靠性,从分析的机制中得出以下实践:

ZFS Scrubbing: Regular scrubbing must be scheduled. Since ZFS detects corruption on read, latent corruption in rarely accessed data (bit rot) might go unnoticed until a critical restore. Scrubbing forces a read and verify of all data, allowing ZFS's self-healing to repair errors while redundancy is still available.10
ZFS 清理:必须安排定期清理。由于 ZFS 在读取时检测损坏,罕见访问数据中的潜在损坏(位腐烂)可能会在关键恢复之前一直未被注意。清理强制读取并验证所有数据,允许 ZFS 的自愈功能在冗余仍然可用时修复错误 10。

PostgreSQL Verification: Administrators should confirm that data_checksums are "ON" via SHOW data_checksums.13 For upgrades, use pg_upgrade with awareness that checksum settings must match, or use --no-data-checksums if migrating from a legacy non-checksummed system.12
PostgreSQL 验证:管理员应通过 SHOW data_checksums 确认 data_checksums 为“ON” 13。对于升级,使用 pg_upgrade 时要注意校验和设置必须匹配,或者如果从遗留的未校验和系统迁移,则使用 --no-data-checksums 12。

MongoDB Write Concern: For data that cannot tolerate even a 100ms loss window (the default commit interval), applications must use a write concern of { j: true } to force immediate journal flushing.18
MongoDB 写入关注:对于甚至无法容忍 100 毫秒丢失窗口(默认提交间隔)的数据,应用程序必须使用 { j: true } 的写入关注来强制立即刷新日志 18。

By strictly adhering to these architectural and operational principles, the system leverages the strengths of ZFS's physical integrity guarantees and the logical/application-level protections of PostgreSQL and MongoDB, creating a robust defense against both physical degradation and logical corruption.3

通过严格遵守这些架构和操作原则,系统利用了 ZFS 物理完整性保证的优势以及 PostgreSQL 和 MongoDB 的逻辑/应用层保护,建立了针对物理退化和逻辑损坏的强大防御 3。

Works cited

Data integrity - Wikipedia, accessed December 12, 2025,

What Is Data Integrity? Why Is It Important? - Fortinet, accessed December 12, 2025,

Preventing, Detecting, and Repairing Block Corruption - Oracle Database 12c, accessed December 12, 2025,

Data Integrity vs Data Quality: What's the Difference? - lakeFS, accessed December 12, 2025,

Logical vs. physical corruption in Oracle - Pythian, accessed December 12, 2025,

An Analysis of Data Corruption in the Storage Stack - USENIX, accessed December 12, 2025,

Silent Data Corruption, the Backup Killer | Enterprise Storage Forum, accessed December 12, 2025,

End-to-end Data Integrity for File Systems: A ZFS Case Study - Computer Sciences Dept., accessed December 12, 2025,

End-to-end Data Integrity for File Systems: A ZFS Case Study - USENIX, accessed December 12, 2025,

Checksums and Their Use in ZFS - OpenZFS Documentation, accessed December 12, 2025,

PostgreSQL 18 enables data‑checksums by default - credativ®, accessed December 12, 2025,

Documentation: 18: E.2. Release 18 - PostgreSQL, accessed December 12, 2025,

Documentation: 18: 28.2. Data Checksums - PostgreSQL, accessed December 12, 2025,

Documentation: 18: 28.1. Reliability - PostgreSQL, accessed December 12, 2025,

fix CRC algorithm in WAL reliability docs - PostgreSQL, accessed December 12, 2025,

Write Concern in MongoDB: A Comprehensive Guide | by Anita Liberatore | Medium, accessed December 12, 2025,

Glossary - Database Manual - MongoDB Docs, accessed December 12, 2025,

Journaling - Database Manual - MongoDB Docs, accessed December 12, 2025,

Write Concern - Database Manual - MongoDB Docs, accessed December 12, 2025,

MongoDB - WiredTiger durability option transaction_sync - Stack Overflow, accessed December 12, 2025,