[笔记] CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data

发表于 2021-04-26 更新于 2022-07-26

原文：CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data

Ceph笔记：

Ceph

Dynamic Metadata Management

CRUSH: Data Placement

TL;DR

CRUSH是一种确定性的hash算法，用来计算数据（对象）的存储位置。

Ceph中CRUSH的价值在于MDS和client都通过CRUSH来计算对象位置，这样client可以直接与OSD（Object Storage Device）通信，MDS也不需要维护对象与OSD之间的映射了。

另一方面，相比其它hash算法，CRUSH有以下优势：

相比平坦的hash，CRUSH支持层级结构和权重，可以实现复杂的分布策略。
在有节点变动时，CRUSH可以最小化数据的迁移（根据分布策略），而简单的hash则会导致大量数据迁移。
对failed和overloaded的设备有特殊处理。

阅读全文 »

[笔记] Dynamic Metadata Management for Petabyte-scale File Systems

发表于 2021-04-23 更新于 2022-07-26

原文：Dynamic Metadata Management for Petabyte-scale File Systems

之前Ceph的笔记

TL;DR

这篇文章介绍了Ceph的Metadata Server（MDS）如何管理metadata，尤其是如何做目录树的动态负载均衡。

（不是很有趣……）

阅读全文 »

[笔记] Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

发表于 2021-04-23 更新于 2022-07-26

原文：Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

TL;DR

Chord解决的是P2P系统（去中心化）中如何高效确定数据位置的问题。

P2P系统中最常见的方案是将数据按一致性hash分散到不同节点上，这样显著减少了节点增减时产生的数据迁移。这套方案中最重要的是每个节点需要维护一个路由表，在接收到请求时，如果数据不在自己上面，需要根据路由表将请求转发给对应的节点。传统的一致性hash里路由表需要包含全部节点信息，产生了O(N)的维护代价，规模大了就会有问题。

Chord在一致性hash基础上，只需要每个节点维护O(logN)大小的路由表，同样能实现O(logN)的查找，即使路由信息过时了也能回退到O(n)查找。减少了路由表的大小，显著地降低了Chord的维护代价，能在节点不停变化时仍然保持不错的性能。

Dynamo似乎使用了类似Chord的方案。

阅读全文 »

Why Uber Engineering Switched from Postgres to MySQL

发表于 2021-04-22 更新于 2022-07-26

原文：Why Uber Engineering Switched from Postgres to MySQL

Uber之前一直用Postgres作为DBMS（最终版本是9.2），后来遇到了很多困难，最终迁移到了MySQL/InnoDB上。

阅读全文 »

[笔记] Deuteronomy: Transaction Support for Cloud Data

发表于 2021-04-22 更新于 2022-07-26

原文：Deuteronomy: Transaction Support for Cloud Data

与Percolator和Omid类似，Deuteronomy也是为多个数据节点提供分布式事务能力的。在Deuteronomy中，提供事务能力的中心节点称为transactional component（TC），数据节点称为data component（DC）。

与Percolator和Omid不同的是，Deuteronomy类似于ARIES，提供的是悲观锁，不需要MVCC支持。它的TC和DC更像是传统的单机RDBMS的一种功能拆分，相互功能依赖比较多，而不像Percolator和Omid一样依赖于分布式KV。

目测Deuteronomy性能不会太好（没有MVCC）。

阅读全文 »

[笔记] Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

发表于 2021-04-21 更新于 2022-07-26

原文：Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

Omid三部曲：

Omid: Lock-free transactional support for distributed data stores

Omid, Reloaded: Scalable and Highly-Available Transaction Processing

Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

这是Omid的第三篇（2018年），主要改进是提出了两种降低延时的新协议，从而满足新的实时事务的需求（作为Phoenix的后端）。

两种新协议分别是：

Omid Low Latency（Omid LL），将Omid中写CT的工作由TM转交给client，从而分散了负载，降低了延时。
Omid Fast Path（Omid FP），处理单key事务时绕开TM，从而达到原生HBase操作的延时（代价是隔离级别从snapshot isolation（SI）降低到了generalized snapshot isolation（GSI）[1]）。

Omid再改进就要考虑分散冲突检测了。

Omid LL中TM有两个角色，一个是TSO，通常不会是瓶颈；另一个是全局冲突检测。我们知道Omid中冲突检测是per bucket进行的，每个bucket独立判断。假如bucket足够小，小到只有一个key，就相当于每个key独立判断一下冲不冲突。此时TM的必要性就不存在了，client可以直接与data server通信判断。

此时Omid就变成了Percolator的变种，即将commit meta放到了单独的commit table中，而不是与数据在一起。

阅读全文 »

ClickHouse基础库学习：StrongTypedef

发表于 2021-04-20 更新于 2022-07-26

位置：base/common/strong_typedef.h

template <class T, class Tag>
struct StrongTypedef
{
private:
    using Self = StrongTypedef;
    T t;

public:
    using UnderlyingType = T;
    template <class Enable = typename std::is_copy_constructible<T>::type>
    explicit StrongTypedef(const T & t_) : t(t_) {}
    template <class Enable = typename std::is_move_constructible<T>::type>
    explicit StrongTypedef(T && t_) : t(std::move(t_)) {}

    template <class Enable = typename std::is_default_constructible<T>::type>
    StrongTypedef(): t() {}

    StrongTypedef(const Self &) = default;
    StrongTypedef(Self &&) = default;

    Self & operator=(const Self &) = default;
    Self & operator=(Self &&) = default;

    template <class Enable = typename std::is_copy_assignable<T>::type>
    Self & operator=(const T & rhs) { t = rhs; return *this;}

    template <class Enable = typename std::is_move_assignable<T>::type>
    Self & operator=(T && rhs) { t = std::move(rhs); return *this;}

    operator const T & () const { return t; }
    operator T & () { return t; }

    bool operator==(const Self & rhs) const { return t == rhs.t; }
    bool operator<(const Self & rhs) const { return t < rhs.t; }

    T & toUnderType() { return t; }
    const T & toUnderType() const { return t; }
};

#define STRONG_TYPEDEF(T, D) \
    struct D ## Tag {}; \
    using D = StrongTypedef<T, D ## Tag>; \

StrongTypedef提供了T的一种别名，它支持以下操作：

默认构造、复制/移动×构造/赋值。
T到StrongTypedef：复制/移动×构造/赋值。
StrongTypedef到T以及T可以隐式转换的类型U：隐式转换，toUnderType。

关键是StrongTypedef禁止了哪些操作：

T相同但Tag不同的两个StrongTypedef，好处是可以避免语义混淆：

using Days = StrongTypedef<int32_t, Tag0>;
using Months = StrongTypedef<int32_t, Tag1>;
Days d(100);
Months d1 = d; // error!

（说实话我觉得StrongTypedef不应该支持到T的隐式转换，避免当T是某种整数类型时无意的narrowing）

[笔记] Omid, Reloaded: Scalable and Highly-Available Transaction Processing

发表于 2021-04-20 更新于 2022-07-26

原文：Omid, Reloaded: Scalable and Highly-Available Transaction Processing

Omid三部曲：

Omid: Lock-free transactional support for distributed data stores

Omid, Reloaded: Scalable and Highly-Available Transaction Processing

Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

这是Omid的第二篇（2017年），相比2014年的初版，它在规模和可用性上做了增强。

主要改动有两个：

增加了meta table，将事务的commit信息下放到存储层，减轻了控制层的的负担。
Status Oracle（SO）进一步变为Transaction Manager（TM），负责timestamp分配（TSO）、冲突检测（SO）与写meta table。

相比其它分布式事务的实现，Omid控制层与数据层切分得非常干净，很适合云上部署。

相比Omid 14中事务meta完全放到TM的内存中，Omid 17独立出来的meta table能支持更大规模，在failover时恢复更快（Omid 14需要从头恢复WAL）。

Omid实现了snapshot isolation，相比Spanner的external consistency（strict serializable isolation）和CockroachDB的serializable snapshot isolation，Omid的优点是只读事务不会abort（但这只是设计的tradeoff吧？）。

（Omid很容易扩展为支持write snapshot isolation，类似于CockroachDB的SSI，参见[笔记] A Critique of Snapshot Isolation）

阅读全文 »

[笔记] Omid: Lock-free transactional support for distributed data stores

发表于 2021-04-17 更新于 2022-07-26

原文：Omid: Lock-free transactional support for distributed data stores

Omid三部曲：

Omid: Lock-free transactional support for distributed data stores

Omid, Reloaded: Scalable and Highly-Available Transaction Processing

Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

Omid的目标是为支持MVCC的分布式KV增加乐观事务（2PC）功能，实现snapshot isolation。

它的特点：

乐观锁（它称为lock-free）。
有中心化的Status Oracle。
依赖于client进行SI需要的判断。

这是Omid的第一篇paper（后面还有2篇），从后续paper来看，这篇paper的实现在规模上还是存在问题，但它确实解决了Percolator的延迟高（尤其是有事务失败时）的问题。

阅读全文 »

[笔记] A Critique of Snapshot Isolation

发表于 2021-03-30 更新于 2022-10-17

原文：A Critique of Snapshot Isolation

这篇文章提出了一种保证了serializability的write-snapshot isolation（WSI）。相比snapshot isolation（SI），WSI是serializable的，且有着不弱于SI的并发性能。相比serializable snapshot isolation（SSI），它的检测成本更低，性能更高。

阅读全文 »