info

date: 2017-11-27 22:19:09

tags:

Created by: Stephan Bösebeck

logged in

ADMIN

Use GridFS or do it yourself

note: This post is from 2017, I just added some numbers to show the efficiency of this approach. At the time of this post, mongodb 3.0 was not out yet.

You don't own it till you make it

This is the well-known DIY saying. The question is, does that also apply in this case: Storing a huge amount of binary data/files.

Basics

Normally, you would not store these binary data in a database, as filesystems are generally much more efficient for this. Things get complicated when you also want to be able to find the files again... because then you quickly find yourself creating many different directory structures for the same files. To avoid wasting storage space, you use links...

That alone is already a reason to store the data in a more structured way. So, we store these metadata in a database and reference (via path specification) the filesystem.

This is also the preferred way if there is not too large a number of files. Because then you quickly reach the limits of it. Anyone who has ever tried to find a file in a directory with a million files knows what I'm talking about.

The filesystem can handle it, the tools have their problems with it. That's a bit inconvenient but okay.

But at some point, it no longer makes sense: the inode density would have to be increased in order to be able to create enough files. To prevent the filesystem from becoming too large (which would slow it down), you might split it onto multiple (virtual) disks... And then you have the problem with scalability...

All in all not so great

Into Mongo, but how?

GridFS

Heureka, there is something that is a "standard". Well... that might be true, but unfortunately, it's somehow... lousy.

Why not?

The waste here is considerable. I can give you a very specific example where simple image data was stored. Several million of them. Especially with MongoDB versions before 3.0, the block size of GridFS is 255 KB - even the last block! This means that if the file is 257 KB in size, you have to occupy 2 blocks. But in the last block, there is only 1 KB of actual data... not very efficient.

Allegedly, this is better from MongoDB 3.0 onwards. But based on our tests, I cannot confirm this 100%. It seems that - especially when reading and writing frequently, and when binary data changes - there is still significant waste. Actually, we could not find a significant difference in waste between version 3.2 and 2.4. And yes, these were separate databases running simultaneously.

In both cases, the waste was often between 80-100%! Of course, this strongly depends on the type of data being stored.

Why small block sizes

Smaller block sizes also have their advantages. Especially when you want to jump back and forth within the binary data, it makes sense. It is also useful when streaming the data, as it requires holding less data in memory at once. However, this is not helpful for our purposes.

Creating GridFS itself

What if we increase the block size significantly if we only want to store and retrieve the files as a whole?

In the end, we create 2 documents:

The document that contains the file metadata, typically including the file name, permissions, or any other desired information. Particularly, it includes a list of IDs pointing to data blocks.
The data blocks. Each data block has only 3 fields: the binary data, an ID (which is being referenced in the metadata document), and optionally a hash. This hash enables deduplication at block level!
We also store a hash in the file metadata to detect identical files. Often, the same file is found under different names/metadata. This allows detecting equality when storing the file (similar to a hard link in the Unix Filesystem). Deduplication at the file level.

The maximum size for a document in MongoDB is 64MB, and for a single field, it is 16MB. That's why we have set the block size to 15MB.

Since we store uncompressed data files, the binary data is also stored in a zipped format.

With all these features, deduplication at the file and block levels, storing zipped data, etc., we have been able to drastically reduce the required storage space. Currently, with approximately 50 million files, which would occupy about 11TB of data in total, we are currently only using around 8TB of storage space.

@Entity
@CreationTime
public class FileMetaData{
    @Id
    private MorphiumId id;
    private String filename;
    private String mimeType;
    private long size;
    private String description;
    @CreationTime
    private long createdAt;
    @LastChange
    private long lastChange;
    private List<MorphiumId> dataBlocks;
    @Index
    private String fileChecksum;
    //Checksum needs to be updated before storing
    //usually something like
    morphium.createQueryFor(FileMetaData.class).f("file_checksum").eq(checksum)

    //if that is null => new file. If not, just update metadata
    //we use SHA as checksum algorithm

    //add getters and setters here
}

@Entity
@Lifecycle
public class DataBlock{
    private MorphiumId id;
    private byte[] data;

    @Index(options = {"sparse:true"})
    private String checkSumHex;

    @PreStore
    public void calcChecksum() {
        if (data == null || data.length == 0) {
            checkSumHex = "";
            return;
        }
        try {
            SHA3 sha = new SHA3();
            sha.engineUpdate(data, 0, data.length);
            checkSumHex = bytesToHex(sha.engineDigest());
        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException("Creating checksum failed: ", e);
        }
    }

    //Before storing a new block, the caller should look for
    //existing blocks with the same checksum for deduplication

    //getters and setters
}