How does Ceph store data? - Behnam Ebrahimpour

To perform storage operations with Ceph Storage Cluster, all Ceph clients regardless of type must:

Connect to the cluster.
Create an I/O contest to a pool.
Set an object name.
Execute a read or write operation for the object.

The Ceph Client retrieves the latest cluster map and the CRUSH algorithm calculates how to map the object to a placement-group, and then calculates how to assign the placement group to a Ceph OSD Daemon dynamically. Client types such as Ceph Block Device and the Ceph Object Gateway perform the last two steps transparently.

To find the object location, all you need is the object name and the pool name. For example:

ceph osd map <poolname> <object-name>

CRUSH algorithm

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH requires a map of your cluster, and uses the CRUSH Map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

CRUSH maps

CRUSH maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

After you deploy a Ceph cluster, a default CRUSH Map is generated. It is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH Map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.

For example, if an OSD goes down, a CRUSH Map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.

Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.

A custom CRUSH Map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) associated with a failed host are in a degraded state.

There are three main sections to a CRUSH Map.

OSD devices consist of any object storage device corresponding to a ceph-osd daemon.
Buckets consist of a hierarchical aggregation of storage locations (for example rows, racks, hosts, etc.) and their assigned weights.
Rule sets consist of the manner of selecting buckets.

OSD devices

To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.

   #devices

   device NUM osd.OSD_NAME class CLASS_NAME

For example:

#devices device 0 osd.0 class hdd device 1 osd.1 class ssd device 2 osd.2 class nvme device 3 osd.3 class ssd

The flexibility of the CRUSH Map in controlling data placement is one of the Ceph’s strengths. It is also one of the most difficult parts of the cluster to manage. Device classes automate the most common changes to CRUSH Maps that the administrator needed to do manually previously.

As a general rule, an OSD daemon maps to a single disk.

Ceph clusters are frequently built with multiple types of storage devices: HDD, SSD, NVMe, or even mixed classes of the above. We call these different types of storage devices device classes to avoid confusion between the type property of CRUSH buckets (for example, host, rack, row). Ceph OSDs backed by SSDs are much faster than those backed by spinning disks, making them better suited for certain workloads. Ceph makes it easy to create RADOS pools for different data sets or workloads and to assign different CRUSH rules to control data placement for those pools.

However, setting up the CRUSH rules to place data only on a certain class of device is tedious. Rules work in terms of the CRUSH hierarchy, but if the devices are mixed into the same hosts or racks (as in the sample hierarchy above), they will (by default) be mixed together and appear in the same sub-trees of the hierarchy. Manually separating them out into separate trees involved creating multiple versions of each intermediate node for each device class in previous versions of SUSE Enterprise Storage.

An elegant solution that Ceph offers is to add a property called device class to each OSD. By default, OSDs will automatically set their device classes to either ‘hdd’, ‘ssd’, or ‘nvme’ based on the hardware properties exposed by the Linux kernel. These device classes are reported in a new column of the ceph osd tree command output:

If the automatic device class detection fails, for example because the device driver is not properly exposing information about the device via /sys/block, you can adjust device classes from the command line:

cephuser@adm > ceph osd crush rm-device-class osd.2 osd.3 done removing class of osd(s): 2,3 cephuser@adm > ceph osd crush set-device-class ssd osd.2 osd.3 set osd(s) 2,3 to class 'ssd'

CRUSH rules can restrict placement to a specific device class. For example, you can create a ‘fast’ replicated pool that distributes data only over SSD disks by running the following command:

cephuser@adm > ceph osd crush rule create-replicated RULE_NAME ROOT FAILURE_DOMAIN_TYPE DEVICE_CLASS

For example:

cephuser@adm > ceph osd crush rule create-replicated fast default host ssd

Create a pool named ‘fast_pool’ and assign it to the ‘fast’ rule:

cephuser@adm > ceph osd pool create fast_pool 128 128 replicated fast

The process for creating erasure code rules is slightly different. First, you create an erasure code profile that includes a property for your desired device class. Then, use that profile when creating the erasure coded pool:

cephuser@adm > ceph osd erasure-code-profile set myprofile \ k=4 m=2 crush-device-class=ssd crush-failure-domain=host cephuser@adm > ceph osd pool create mypool 64 erasure myprofile

In case you need to manually edit the CRUSH Map to customize your rule, the syntax has been extended to allow the device class to be specified. For example, the CRUSH rule generated by the above commands looks as follows:

rule ecpool {
  id 2
  type erasure
  min_size 3
  max_size 6
  step set_chooseleaf_tries 5
  step set_choose_tries 100
  step take default class ssd
  step chooseleaf indep 0 type host
  step emit
}

The important difference here is that the ‘take’ command includes the additional ‘class CLASS_NAME‘ suffix.

To list device classes used in a CRUSH Map, run:

cephuser@adm > ceph osd crush class ls [ "hdd", "ssd" ]

To list existing CRUSH rules, run:

cephuser@adm > ceph osd crush rule ls replicated_rule fast

To view details of the CRUSH rule named ‘fast’, run:

cephuser@adm > ceph osd crush rule dump fast { “rule_id”: 1, “rule_name”: “fast”, “ruleset”: 1, “type”: 1, “min_size”: 1, “max_size”: 10, “steps”: [ { “op”: “take”, “item”: -21, “item_name”: “default~ssd” }, { “op”: “chooseleaf_firstn”, “num”: 0, “type”: “host” }, { “op”: “emit” } ] }

To list OSDs that belong to an ‘ssd’ class, run:

cephuser@adm > ceph osd crush class ls-osd ssd 0 1

Buckets

CRUSH maps contain a list of OSDs, which can be organized into a tree-structured arrangement of buckets for aggregating the devices into physical locations. Individual OSDs comprise the leaves on the tree.

Ceph’s deployment tools generate a CRUSH Map that contains a bucket for each host, and a root named ‘default’, which is useful for the default rbd pool. The remaining bucket types provide a means for storing information about the physical location of nodes/buckets, which makes cluster administration much easier when OSDs, hosts, or network hardware malfunction and the administrator needs access to physical hardware.

A bucket has a type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity/capability of its item(s), the bucket algorithm ( straw2 by default), and the hash (0 by default, reflecting CRUSH Hash rjenkins1). A bucket may have one or more items. The items may consist of other buckets or OSDs. Items may have a weight that reflects the relative weight of the item.

[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw2 | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}

The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.

host ceph-osd-server-1 {
        id -17
        alg straw2
        hash 0
        item osd.0 weight 0.546
        item osd.1 weight 0.546
}

row rack-1-row-1 {
        id -16
        alg straw2
        hash 0
        item ceph-osd-server-1 weight 2.00
}

rack rack-3 {
        id -15
        alg straw2
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}

rack rack-2 {
        id -14
        alg straw2
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}

rack rack-1 {
        id -13
        alg straw2
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}

room server-room-1 {
        id -12
        alg straw2
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}

datacenter dc-1 {
        id -11
        alg straw2
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}

root data {
        id -10
        alg straw2
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}

Rule sets

CRUSH maps support the notion of ‘CRUSH rules’, which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH Map has a rule for the default root. If you want more roots and more rules, you need to create them later or they will be created automatically when new pools are created.

A rule takes the following form:

rule rulename {

        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step

}

ruleset

An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is 0.

type

A string. Describes a rule for either a ‘replicated’ or ‘erasure’ coded pool. This option is required. Default is replicated.

min_size

An integer. If a pool group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 2.

max_size

An integer. If a pool group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 10.

step take bucket

Takes a bucket specified by a name, and begins iterating down the tree. This option is required. For an explanation about iterating through the tree.

step targetmodenum type bucket-type

target can either be choose or chooseleaf. When set to choose, a number of buckets is selected. chooseleaf directly selects the OSDs (leaf nodes) from the sub-tree of each bucket in the set of buckets.

mode can either be firstn or indep.

Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N – num; and, if num == 0, choose N buckets (all available). Follows step take or step choose.

step emit

Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows step choose.

CRUSH algorithm

CRUSH maps

OSD devices

Buckets

Rule sets

Leave a Reply Cancel reply