Redis持久化

由于Redis是内存型数据库，重启后内存中的数据就会丢失，因此需要持久化落盘，Redis 提供了 AOF 和 RDB 两种持久化机制。

AOF 日志

AOF(Append Only File) 是将Redis执行的每一条写操作都以追加的方式记录到一个文件里，重启后只需要重新按序执行文件中所有的命令。（该功能默认不开启）

Redis 主线程先执行命令再写AOF日志，否则写完日志后执行发现命令有问题还需要回去修改日志。

AOF 写回策略

前面提到的写AOF日志其实只是写入了内核缓冲区而没有写回到硬盘，此时宕机仍然会丢失数据，首先我们看一下AOF的完整流程：

Redis 执行完写操作命令后，会将命令追加到 aof_buf 缓冲区；
然后通过 write() 系统调用，将aof_buf缓冲区的数据写入到 AOF 文件，此时数据并没有写入到硬盘，而是拷贝到了内核缓冲区 page cache，等待内核将数据写入硬盘
写回到硬盘，具体什么时候写回Redis提供了三种写回策略（通过appendfsync 配置项配置）：
- Always：每次写操作执行完后都写回（直接调用fdatasync，相比fsync，fdatasync不会同步文件属性只同步文件数据，参考：一分钟了解 sync、fsync、fdatasync 系统调用）
- Everysec：每隔一秒将缓冲区内容写回硬盘（直接调用fdatasync）
- No：由操作系统决定什么时候写回（被修改的内存页变成了脏页，当空闲内存低于某个阈值 or 脏页驻留时间达到阈值时会将脏页写回磁盘）这部分可以参考系统调用、用户缓冲区、内核缓冲区、底层IO知识笔记这篇文章。
Everysec 和 No这两种写回策略很明显仍然会出现消息丢失，那么Always策略能100%保证数据不丢失吗？

小林coding中说可以最大程度保证数据不丢失，但没有给出具体说明，网上很多文章是这样说的：

在flushAppendOnlyFile方法中，有if (server.aof_fsync == AOF_FSYNC_ALWAYS)的判断，如果条件符合，会使用fdatasync()的方法来写磁盘。大体就是：先把写命令追加到aof buffer中，下一次进入事件循环循环后，再将buffer写到磁盘上。结合while循环处方法的调用顺序，可以看出确实是这样的。那么也就是说，这次写到磁盘上的内容是上一个事件循环产生的所以，即使设置为always，也会丢失一个循环的数据。redis能保证数据100%不丢失吗？

但感觉明显有问题，为什么非要等下一个循环再写回呢？阅读了flushAppendOnlyFile发现并非如此，该函数注释中明确写到we accumulate all the AOF writes in a memory buffer and write it on disk using this function just before entering the event loop again。在该函数中会先将aof_buffer中的内容写到文件中，然后调用fdatasync()的方法来写磁盘，写入成功后才会认为命令执行成功。因此我认为忽略文件损坏以及硬件故障等极端情况可以认为always策略可以保证数据不丢失。

/* Write the append only file buffer on disk.
 *
 * Since we are required to write the AOF before replying to the client,
 * and the only way the client socket can get a write is entering when
 * the event loop, we accumulate all the AOF writes in a memory
 * buffer and write it on disk using this function just before entering
 * the event loop again.
 *
 * About the 'force' argument:
 *
 * When the fsync policy is set to 'everysec' we may delay the flush if there
 * is still an fsync() going on in the background thread, since for instance
 * on Linux write(2) will be blocked by the background fsync anyway.
 * When this happens we remember that there is some aof buffer to be
 * flushed ASAP, and will try to do that in the serverCron() function.
 *
 * However if force is set to 1 we'll write regardless of the background
 * fsync. */
#define AOF_WRITE_LOG_ERROR_RATE 30 /* Seconds between errors logging. */
void flushAppendOnlyFile(int force) {
	...
    
    nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    
    ...
    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* redis_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        latencyStartMonitor(latency);
        /* Let's try to get this data on the disk. To guarantee data safe when
         * the AOF fsync policy is 'always', we should exit if failed to fsync
         * AOF (see comment next to the exit(1) after write error above). */
        if (redis_fsync(server.aof_fd) == -1) {
            serverLog(LL_WARNING,"Can't persist AOF for fsync error when the "
              "AOF fsync policy is 'always': %s. Exiting...", strerror(errno));
            exit(1);
        }
        latencyEndMonitor(latency);
        latencyAddSampleIfNeeded("aof-fsync-always",latency);
        server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        server.aof_last_fsync = server.mstime;
        atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
               server.mstime - server.aof_last_fsync >= 1000) {
        if (!sync_in_progress) {
            aof_background_fsync(server.aof_fd);
            server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        }
        server.aof_last_fsync = server.mstime;
    }
}

对比：

写回策略	写回时机	性能	数据防丢失
Always	同步写回	⭐	⭐⭐⭐
Eversec（默认策略）	每秒写回（异步任务）	⭐⭐	⭐⭐
No	操作系统写回策略（内存占用/脏页驻留时间超过某阈值）	⭐⭐⭐	⭐

AOF 重写

随着Redis不断接受客户端的命令，AOF文件越积越大，就会影响性能，为了避免AOF文件越来越大，Redis提供了AOF重写机制，当文件大小超过某个阈值后就会触发重写以压缩AOF文件，压缩原理也很简单，只需要保留最新的记录就可以了。重写的时候为了防止破坏原文件，会创建一个新的副本文件，重写完成后再覆盖过去。

但AOF重写是一个很耗时的工作，如果放到单进程中去做就会长时间阻塞命令执行，因此Redis采用了后台重写机制，主进程继续处理命令请求，由子进程去慢慢处理重写任务。

为什么使用子进程而不是多线程呢？

如果是使用线程，多线程之间会共享内存，那么在修改共享内存数据的时候，需要通过加锁来保证数据的安全，而这样就会降低性能。而使用子进程，创建子进程时，父子进程是共享内存数据的，不过这个共享的内存只能以只读的方式，而当父子进程任意一方修改了该共享内存，就会发生写时复制，于是父子进程就有了独立的数据副本，就不用加锁来保证数据安全。

写时复制原理

主进程通过fork系统调用生成bgrewriteaof 子进程，操作系统会把主进程的页表复制一份给子进程，这样两者的虚拟空间虽然不同，但对应的物理空间是一个。此时页表项属性会被标记为只读，当父进程（Redis主进程）向这个内存发起写操作时，CPU 就会触发写保护中断，操作系统就会在这个中断处理函数中进行响应页表项对应物理内存的复制，并将页表项属性修改为可读写，之后父进程对该页表项的修改就不会影响子进程，这样子进程可见的始终是触发重写时刻的AOF文件。

这样相当于将重写的耗时分摊到了多次请求上，触发请求需要承担fork子进程的耗时，后续请求如果触发写保护中断需承担复制物理内存的耗时。但相比阻塞进行重写，这种方案性能无疑是更优秀的。

走到这一步，应该会产生一个疑问，bgrewriteaof子进程重写的只是产生重写任务这一时刻的AOF文件，后面主线程还会处理新的请求，如何保证一致性呢？这里其实也很简单，Redis只需要记录一下在此期间产生的所有写命令（实际上Redis是写入到一个AOF 重写缓冲区中），等子进程重写完成后通过信号机制与主进程通信，主进程收到该信号后会AOF 重写缓冲区中的所有内容追加到新的 AOF 的文件中，并用新的AOF文件覆盖现有文件。

⚠️注意: 以上重写流程是Redis 7.0之前的，现在新版本的Redis（>=7.0）文档中有提到新旧版本重写时的不同：

Redis >= 7.0

Redis forks, so now we have a child and a parent process.

The child starts writing the new base AOF in a temporary file.

The parent opens a new increments AOF file to continue writing updates. If the rewriting fails, the old base and increment files (if there are any) plus this newly opened increment file represent the complete updated dataset, so we are safe.

When the child is done rewriting the base file, the parent gets a signal, and uses the newly opened increment file and child generated base file to build a temp manifest, and persist it.

Profit! Now Redis does an atomic exchange of the manifest files so that the result of this AOF rewrite takes effect. Redis also cleans up the old base file and any unused increment files.

Redis < 7.0

Redis forks, so now we have a child and a parent process.

The child starts writing the new AOF in a temporary file.

The parent accumulates all the new changes in an in-memory buffer (but at the same time it writes the new changes in the old append-only file, so if the rewriting fails, we are safe).

When the child is done rewriting the file, the parent gets a signal, and appends the in-memory buffer at the end of the file generated by the child.

Now Redis atomically renames the new file into the old one, and starts appending new data into the new file.

简单来说，7.0版本中主线程不再是像以前那样在重写过程中像aof缓冲和aof重写缓冲中都写入修改，而是把这一时间段的修改操作写入一个新的AOF增量文件，这样重写后的文件+增量文件就是全部的数据了。

RDB 快照

RDB(Redis Database File)就是记录Redis某一瞬间的内存数据，因此在恢复时只需要将RDB文件读入内存即可（RDB文件的加载是在Redis服务启动时自动加载的，没有提供专门的命令来加载该文件），效率相比AOF更高。但由于生成一次RDB需要把全部数据都记录到磁盘文件中，耗时比较长，因此快照频率不能过高，这也导致了服务器故障时，丢失的数据更多。

Redis提供了两个命令来生成RDB文件：

save，在主线程中生成快照，会阻塞主线程；
bgsave，创建子进程来生成RDB文件，可以避免主线程阻塞；

同时Redis还可通过配置来实现自动bgsave，如save 300 10表示 300 秒内对数据库进行了至少10次修改就会触发一次bgsave，

当然，bgsave也采用了子进程的写时复制特性来提高效率，保证在生成RDB文件过程中，主进程仍然可以处理命令，但这也导致了子进程创建的RDB文件只能是fork子进程这一时刻的块照，之后的修改是不会被记录到这个RDB文件中的。极端情况下，复制时所有的页表项都被修改了，这样内存会变成原来的两倍。

对比

参考Redis persistence, 对比AOF和 RDB两种持久化策略：

AOF advantages

Using AOF Redis is much more durable: you can have different fsync policies: no fsync at all, fsync every second, fsync at every query. With the default policy of fsync every second, write performance is still great. fsync is performed using a background thread and the main thread will try hard to perform writes when no fsync is in progress, so you can only lose one second worth of writes.

The AOF log is an append-only log, so there are no seeks, nor corruption problems if there is a power outage. Even if the log ends with a half-written command for some reason (disk full or other reasons) the redis-check-aof tool is able to fix it easily.

Redis is able to automatically rewrite the AOF in background when it gets too big. The rewrite is completely safe as while Redis continues appending to the old file, a completely new one is produced with the minimal set of operations needed to create the current data set, and once this second file is ready Redis switches the two and starts appending to the new one.

AOF contains a log of all the operations one after the other in an easy to understand and parse format. You can even easily export an AOF file. For instance even if you’ve accidentally flushed everything using the FLUSHALL command, as long as no rewrite of the log was performed in the meantime, you can still save your data set just by stopping the server, removing the latest command, and restarting Redis again.

AOF disadvantages

AOF files are usually bigger than the equivalent RDB files for the same dataset.

AOF can be slower than RDB depending on the exact fsync policy. In general with fsync set to every second performance is still very high, and with fsync disabled it should be exactly as fast as RDB even under high load. Still RDB is able to provide more guarantees about the maximum latency even in the case of a huge write load.

RDB advantages

RDB is a very compact single-file point-in-time representation of your Redis data. RDB files are perfect for backups. For instance you may want to archive your RDB files every hour for the latest 24 hours, and to save an RDB snapshot every day for 30 days. This allows you to easily restore different versions of the data set in case of disasters.

RDB is very good for disaster recovery, being a single compact file that can be transferred to far data centers, or onto Amazon S3 (possibly encrypted).

RDB maximizes Redis performances since the only work the Redis parent process needs to do in order to persist is forking a child that will do all the rest. The parent process will never perform disk I/O or alike.

RDB allows faster restarts with big datasets compared to AOF.

On replicas, RDB supports partial resynchronizations after restarts and failovers.

RDB disadvantages

RDB is NOT good if you need to minimize the chance of data loss in case Redis stops working (for example after a power outage). You can configure different save points where an RDB is produced (for instance after at least five minutes and 100 writes against the data set, you can have multiple save points). However you’ll usually create an RDB snapshot every five minutes or more, so in case of Redis stopping working without a correct shutdown for any reason you should be prepared to lose the latest minutes of data.

RDB needs to fork() often in order to persist on disk using a child process. fork() can be time consuming if the dataset is big, and may result in Redis stopping serving clients for some milliseconds or even for one second if the dataset is very big and the CPU performance is not great. AOF also needs to fork() but less frequently and you can tune how often you want to rewrite your logs without any trade-off on durability.

翻译总结一下，AOF和RDB的优缺点如下：

持久化策略	优点	缺点
AOF	数据丢失量少采用追加操作，文件不易损坏支持自动重写来提高性能日志格式易于理解和解析	文件更大速度慢
RDB	更适合定期备份存档适用于于容灾备份性能更好可以更快恢复	数据丢失量大 `RDB` `fork`操作更频繁，开销大

AOF + RDB

RDB 恢复速度快，而 AOF 丢失数据少，为了结合两者的优点 Redis 4.0 提出了混合持久化方案，即混合使用AOF日志和RDB内存快照。

混合持久化时，在AOF重写日志时，子进程会先以快照的机制将内存数据写入到AOF文件中，而主线程处理的操作命令会记录到重写缓冲区中，之后重写缓冲区内容再被加入到AOF文件，也就是说AOF文件前面是RDB数据，后面是AOF格式的修改命令。这样恢复速度变快，而且借助后面的AOF内容，在数据丢失时失去的数据更少。

参考

Redis persistence

小林coding Redis持久化篇

一分钟了解 sync、fsync、fdatasync 系统调用

系统调用、用户缓冲区、内核缓冲区、底层IO知识笔记

📔【操作系统】写时复制 Copy-on-write