Categories
Online Marketing

PyCon.DE 2017 Thomas Waldmann – The BorgBackup Project

So this is about Bart back up it’s about two and a half years old project, but the software is quite a bit older, because the project was forked from another project called attic so attic. It was four or five years old when we forked it so it’s quite old, but not very well known yet, because attic did not get much advertising back then some guy who found it in 2013 from Greece, I think, wrote about attic.

I found the holy grail of backups, so he what was quite impressed by it and from heat was the same and I found attic back. Then it was quite kind of oh, that’s, nice, stuff. Let’s, let’s use this, I was using our sink before and I just searched for something more modern and after looking at a lot of tools, I found attic and it was somehow the best and it was in Python. So I hacked on it and we had to fork the project, because the original project was not going on quickly and there was also no cooperation, so pork makeup is kind of the fast going attic somehow now this is a bit about me.

I am doing Python since about 2001. I think it started with my mine wiki project. That was basically the reason for me to learn Python also doing Linux, since it wasn’t floppies, free and open source software. These are some of the projects I was heavily involved in so the first one you maybe know from the python.Org wiki it’s running on Moin Moin, original, Auto sitting here and I okay same story yeah.

The second one is a dynamic DNS service software written in Django, be pasty is a paste pin that can be used for binary stuff. Also, VPN gateway is not a software project. It’s basically just some configuration and porkpie copy. You will soon hear the details. This is my email address at my company and yeah, I’m doing Python development. So if you search for freelance remote develop or talk to me so about boric yeah, it’s a backup tool.

There are dozens of backup tools, so there should be something special about it, and the special thing is that a lot of tools are somehow a pain to use their idle, slow or not always working or you can’t use them when our platforms – and so this feature Said somehow, it reads rather stupid simple, but you will see this is the special stuff somehow about boric. So about simple. If each of your backup is a full pack up, it’s it’s very simple to manage.

If you want to delete one, you can just delete it and it will not influence anything else. If you have the usual full and incremental and differential stuff, you have to be careful what you delete, because it might influence other backups. If you want to restore stuff, you can just do a fuse mount and basically copy your files out of the backup archive or search for your files. So you don’t have to use a lot of command line, commands to find your stuff easy pruning.

You can basically define a policy, I want to keep so, and so our only backup so and so weekly backups on so daily backups, and it will just that policy. It’s a one line of command. The tooling is also very simple: you have just the boric software. You have SSH for remote stuff and you just write a shell script and that’s it. It’s not a complex thing. There are also also quite some good documentation and main pages, and so on, so you can look up stuff.

We offer a single file binary. So if you just want to throw it on a machine and it should work, you can use that, so you don’t need to install header files and compile stuff and to keep check out or such stuff. It’s just a file that includes everything, even Python and all libraries, and also it’s simple. If you can just use the same backup tool on all your machines, so we support Linux, BSD Mac OS X, even under Windows, you can use it under psych win or with the Linux subsystem and for Windows 10.

There is no native Windows support yet because we have no windows developer, but we could do it if somebody would care for it. Also. We support a lot of file system features so extended attributes, ACLs and so on, and even if you have a strange architecture, there’s this big endian, it also will work so there’s quite a lot of testing about a point efficient, it’s extremely fast for unchanged files.

So it’s always a full backup, that’s done, but it will not feel like a full backup, because it’s so fast for unchanged files. It will basically feel like a differential backup. Although the backup archive includes all the files, not only the changed ones, chanty table keishon is important. It’s that not only DTaP like a ting complete files if they are completely identical, it’s enough. If somehow a piece of the file is the same.

Also, it’s not caring about file names. It’s just looking at the content, we’ll see more details about this later. We also have flexible compression, so we can have a have it either very fast or very good compression. It’s not flooding your file system cache! If you, if you read gigabytes of files, all the time while doing your backup, usually your file cache from the operating system takes a lot of memory, and maybe you basically flood out other stuff that should be in the cache just by doing a backup, and we Avoid this by some special system calls it’s not only in Python.

We also have a bit of C and sizin for being more efficient with memory and also being faster, and we have hardware accelerated cryptography just by using open SSL so about safety. There are a lot of checksums. There is some CLC 32 on the low level, basically and there’s also a lot of cryptographic, hashing and making going on. So if something is corrupt also, we will notice it. We use transactions.

So if you start a backup and somehow the machine crashes or the connection goes down, there is no problem, it will just roll back to transaction. We are doing lot of thinking to the file system, atomic file system operations and the whole thing is like keep a key value store, but it’s not like, so we always append at the end and at the beginning we don’t change stuff, except if we delete it. So that’s a rather safe thing.

If something goes wrong. There are also check points while you do the backup. So if you have a longer in backup that runs for days, it will do a check point now and then and if something goes wrong, you will just. You will still have that stuff that you push to the repository. You will not have to completely start from the beginning and you can use off-site repositories. So if your house burns down that’s also kind of a safety feature, it’s also secure.

We are using authenticated encryption, so basically the threat model. We don’t trust the repository server. It could be at a hosting company or something so if somebody looks inside your repository, you should not see anything because everything is encrypted. The metadata at the data, because it’s authenticated encryption can also detect tampering. So if somebody is playing with the bits and just toggling some bits, we will notice it because we check this.

There is SSH as the Transport for remote repositories, so basically you get all the security properties from as its age. You will have a secure connection and also, if you use key login, you will have a good authentication and you don’t have to care for an extra services security issues. Concerning the network exposure, we also support a special append-only mode for repositories. It means that nothing will change that was already there.

We only append at the end. So even if some bad guy is owning your client machine and using Borak to delete stuff, he will not really delete it. The delete will just be recorded at the end, but nothing at the beginning will change so you can just delete some files and everything will be as before. It’s free and opensource. We can look in the code about the crypto or some details. Of course, we encrypt client-side, because the server is not trusted metadata and data, its authenticated encryption, it’s the encrypt-then-mac mode.

This is the more secure mode and it’s counter mode of AES and H make sha-256 or since 1.1, we also have displayed to be it’s also. A hash or a Mac, it’s just a lot of fast form. We do counter management, it’s important for this counter mode, that you never repeat the counter value with the same key, and we have some sort of reservation going on. So, even if the connection breaks or something bad happens, it will never repeat counter values.

The key material is either on the client or you can also store it. In the repository in the in the config of the repository, the key itself is encrypted, so it’s no problem and the encryption is done with pbkdf2 in aes. The repository mode is a bit nicer. If you don’t have a separate backup of your key and we support both the old and also the new one of our version of open ssl from open ssl, we only use lip crypto with the crypto, primitive, so nothing complex, so that stuff should work quite okay.

The compression stuff is junk based, so it’s only a piece of the file, not a full file, usually except if the file is rather small. There are some algorithms, fast, algorithms, medium fast and rather slow, and you will get more or less good compression a nice thing with lsat, for is it’s often faster than if you use no compression at all. If course leads needs a little bit of time to do. The compression, but you have to store less data to disk or to a remote server, so it’s safe small compression more time, then it’s it’s leading for a compression in 1.

1. We also have this auto mode. It uses as it for a prediction. Basically, can I compress this file and if it looks good, then it uses expensive compression to get even more out of it and with Beaudry create you can even change the compression mode if you started with lsat 4 and later you want something stronger about this. The application stuff – this is one of the main features of Borak. You have to not only imagine it as somehow duplicate files in your file system.

That is one dimension. You might have copies, so you have identical files on the same machine. Of course, it will deter placate these files. Also, if you have a virtual machine, maybe and a lot of zeros are coming from disk or from the kernel. When you read that file, it will duplicate all those zeros. Also, this is basically the inner tablet. You know the application of the data set, it’s just dupes inside your source data, but there is also a historical deduplication if you’re, making full backups all the time.

Of course, most of your files will be the same and not change. Some files will change, but a lot of files just won’t change, so it will also deduplicate them, and you can also have data application between machines. If you move files from one machine to another machine – and you pick up both machines to the same repository, it will just read applicate it also because it already has that data or if you have the same operating system on all your machines or if you have The same data on multiple machines, so these are basically the three dimensions of this deduplication.

How does it work? It reads the file and then cuts. It cuts the file into a variable length, chunks. It decides by the content when it should cut. So it’s just the rolling hash, that’s computed and if the head says zero, it will cut the nice thing about this. You could also cut at specific positions, but then you have a problem. If your content is shifting a bit to the end or to the beginning, then every chunk would change.

But if you cut by content, then the cutting of places will also shift it’s very nice for virtual machine disk files, usually not a whole file changes, but only some sectors, basically in this virtual machine file, and it will only backup these new chunks and everything else. That’s still, the same is already in the repository. You can also rename huge directories and it will still have same content, so your repository is not growing.

It can look like this. This is actual data from one of my repositories and the knife. This is one so you see if I would have just used Tower, I would need 22 terabytes of disk space without Lisette it would be still 18 terabytes and with the dwk ssin it’s just half a terabyte, so most of the stuff was somehow the same. This is this historical deduplication and you see total chunks. This is basically the references to chunk.

Ids and unique chunks is way less because a lot of chunk references are referencing the same chunk in the next version. We will introduce multi-threading. Currently, it’s the single threaded and we plan to use 0 mq, so it will use more of your CPU, not just one or half half of a core. The Gil might be no big issue, because there is lots of i/o and lots of C code. So we can just release the Gil when doing that stuff, and we will also do some some crypto improvements and maybe go to open SSL 1.

1 as a requirement some stuff up our project. I have to hurry up a bit. We are using Python size and C for the usual reasons, see if it’s extremely important to save resources site in this more or less glue code and interfacing stuff in python is the high level logic of use. Cherry CI, it checks all the pull requests and all the branches and multiple price inversions we use PI tests and talks. High test is quite nice.

It’s not that much boilerplate like the normal unit test stuff, so it’s actually fun to write code, write tests and toxis on top of it running PI tests for every Python version. Pi, n F is also nice. If you want to have a specific python version. For example, 3 point 4.0: you usually don’t get it in your distribution. You can just use PI n to install any version you like, and if you want to find somehow problems, then you always use the oldest point release or the dot zero release, because there are the most bugs and people might even have that version.

So if you want to find everything, just use the oldest version and of course, if you are building something you distribute, you rather use the latest version, because that’s the best version. We also use a lot of little machines in automatics with vagrant, so we can test on all these operating systems and even a PowerPC. The Turing machine is possible using Q email and if you do that, you have way less surprises.

Oh, it doesn’t work on X because you have tested it, so it usually works pints dollars. Also, nice thing: it’s making a one file binary of all the stuff. You need to run your software, so there is the Python code inside the Python interpreter. All the shared libraries except the G Lipsy that needs to come from the operating system, but it’s quite nice. You can just throw it under your system and run it, and you are done.

You don’t need to install a lot of stuff a word about secure, releasing if you think about it, a lot of people just download some binary somewhere and then run it as root. So what could go wrong if the binary is tampered, it could even happen under transmission. Then you have a problem. So maybe, if you release software, especially if it’s binary stuff, maybe rather assign it with GPG, then people can really check if it is the same stuff that you have produced.

If you just publish a hash like a sha-256, it’s better than nothing but not much better, because the hash could also be tempered, and if you check it, it will of course match an attacker can also compute the hash of the fake binary. Sorry, you really have to sign it with a release, key that only you have setup tools. Scm is a nice tool, so usually you have to pump your version number somehow increase from 1.

0 to run point 1 or something this tool automates this for you. You can just use text in get and set up to its SCM will just compute a version from it, and it’s not only the release stuff, it’s also the stuff in between. So if you output this in your tool or you exactly know what a user is running and it’s no effort just changing a few lines in your project and you can use it things, maybe a lot of you know already.

We have some special stuff. We build a lot of automatically from our paths, so all our usage stocks and the main pages are basically extracted from Python code, so we don’t have to maintain them separately. If you have a readme for your project, maybe think of it as a elevator speech. So don’t write the installation steps into it. Just try basically to sell the stuff because people read it and then decide if they use your stuff or not so don’t put a lot of other stuff in it read the docs is quite nice.

It hosts your documentation and it even supports multiple versions of your software, so users can select whether they want to read talks about 1.0 or 1.1. They have nice mobile support of a PDF s download and they also use things and they pull your stuff automatically from github. So you don’t have to care for the hosting as Kinema is somehow it looks like a movie, but it’s not really a movie. It’s just some JavaScript interpreting a JSON file and you can basically see it typing you typing commands and the output.

The nice thing is rather small. You can just commit it to your repository and you can even copy and paste stuff from it, because it’s not a article. It’s just text output by a JavaScript and if, in the record you made some typos, you can just editor chase the file. You don’t have to record your article again, we use github, I think most of you already know this may be worth mentioning is bounty sauce.

So if you want to have a way for people to donate funds to you and make fundraisers or basically put a bounty on fixing some issues, you can use bounty sauce for it. We basically every donation. We get comes in over bounty sauce and I usually then just select some tickets and put some money on it. So basically the money gets distributed to the people who do the work and close this ticket yeah. My stones are quite nice for release planning and you can reuse your documentation for the github readme.

Also, we have a community repository where people can just say. Oh I’ve read written that nice script for Borak and then we can just link to them so yeah. The usual github features – the releases stuff is also quite nice because you can put all your binary since those close go there and it’s also based on attack in the repository yeah, the usual communication blogs. We have a mailing list, IRC Twitter, for support, discussion, release announcements and you can help.

We have a few developers currently, but could be more just try it. Maybe if you like it test, it find parks, improve Doc’s whatever. If you use Windows and if you like Windows, we have no windows developer yet so that would be a good thing or if you use it, you can also donate funds. We are bounty sauce and this is the home page and you can also grab me outside for questions. Do we have time for questions still I up? Okay, so yeah, so I’m in time, great okay, you can always you we can always use, for example, our sink or our clone to just copy the stuff elsewhere.

You just should not. You just should not update both copies because that causes crypto is used with the counters and stuff, but I think quite some users basically do first, the local backup and then somehow sync it to the cloud in case their house burns down or not a company. So that’s one mode of operating it. We don’t have direct cloud support. We just support putting stuff into directories or talking client-server over SSH, but you need Borg at the other end, so it won’t work with Amazon or something except if you run a server there yeah well, that there are multiple caches.

The question was about sinking caches between machines, usually the the local files cache is about the files you have on that machine. So it won’t be useful if you sync it to another machine, because the files might be different. There are also some other caches, but maybe you shouldn’t do that that somehow too deep into the internals there is one problem by the way. If you use multiple machines and you push your stuff to the same repository, then you basically bring the cash out of sync with the repository.

It’s like cache, coherency the usual problem. Then port will be a bit slower because it has to rebuild some caches first before it starts to back up. If we do them alternatingly. So that’s a bit of an unsolved problem. Yet no it’s it’s locked. It might be possible in the future, but but not right. Now, yeah yeah um, the call to not spoil the file system cache by pumping get gigabytes of data is F advice.

You can basically see okay, I’ve read that data, but I don’t. I won’t need it and anytime soon, so it basically just drops that cache and there is. There was some sort of a discussion, whether it’s good or bad. If we do effort wise, but I think overall, it is good to do it. There were some people with other opinions, but I think if you always somehow flood the cache all the time, that’s way worse, then, if you maybe effort wise something that that needs to get reloaded by some other process, yeah yeah, that’s a bit tricky.

The point is we basically do it like Python, there are objects and there are references to these objects and the fast processing of unchanged files works with this files cache and in the files cache it has the modification time or the change time. It has the size of the file and the inode number, and if all these did not change that the file is still the same, and it also has a list of the of the chunk IDs and then it will just create an item by using that information.

It will basically create a metadata and everything else. The data is already in the repository yeah, but it it doesn’t matter. Is it’s like a hard link. It doesn’t matter if it’s the first hard link or a second half link, so just a reference. Basically, no not yet, but we have a JSON API. Meanwhile, so one could write agree now, but there is nothing usable yet except some small web interface, but it’s only for very basic use cases, but not for Python 3.

I think yeah you can do as many backups as you like, and it’s even good, because you should not lose anything because it’s completely deduplicated, so there is no redundancy, so better, maybe have two of them or have a rather good hardware. That does not lose data. I will also be at a conference until Sunday at the sprints, for example, so just grab me anytime. So thank you.


 

By Jimmy Dagger

Find out my interests on my awesome blog!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.