The Cost of Bad Design and Disaster Recovery

The consequences of bad design

This week the laptop I used as daily driver finally died after less than two years of continued used.

It was a Vant, from a company in Valencia that designs and sells fully Linux compatible laptops.

I decided to buy it based on a friend recommendation, who had a good experience with one of their earlier models, maybe without doing the due diligence I should have done when investing a large ammount of money in a piece of equipment from an unproved IT provider.

This is just another testament to the importance of reputation and happy clients in business deals, you can make customers out of people that are unsure about you.

I had money available, so I decided to invest in heavily on a machine with large specifications so that I could face any kind of development workloads on my own such as running Android Studio, Minikube or virtual machine labs.

However, the laptop itself had several issues from the beginning I should have not ignored.

The most anoying of them was that it would get out of suspension at random times, sometimes while in my bag risking overheat and significantly reducing the batery life.

It also had some minor issues such as a retroiluminated keyboard that would just stop working after suspending it, making it just a normal keyboard.

Overheat was one of its main threats, easily reaching 90 degrees due to the use of a heat sink that was too small for the 12th generation i7 it had, causing a performance throttle.

After the second summer with it, heat issues got so bad that I decided to use liquid metal to increase the performance of its small heat sink.

Having a leak and a drop of that liquid metal into the motherboard was what killed it by producing a short circuit.

You might say the destruction of my work equipment came from my skill issues, a criticism I will accept given that it was the first time I was using liquid metal.

However, we could argue the need to use it was ultimately caused by bad laptop design.

A well-designed laptop would have most likely not had the kind of issues this one had regarding CPU heat and energy management, forcing its owner to hack it with experimental techniques in order to fix those issues.

I do not have the necessary knowledge to assess what could have led to this bad design because I deal in software not in hardware, but I can suppose that someone that made great Vant laptops eventually moved to better opportunities abroad and the company was unable to keep the magic within.

The first time I moved to a conference out of Spain I saw that the best people my country could provide where not in Spain but part of the spanish diaspora of great people looking for better opportunities abroad.

But enough with the nationalistic lamentations.

We could also blame the fact that newer generation of intel processors are getting worse.

For some extrange reason I feel that the i5 6200U in the Thinkpad x260 I am currently using as a backup is more responsive that an i7 of the 12th generation.

There is also news that the processors for the 13th and 14th generation are rotting away due to faulty software.

More broadly, we had infamous vulnerabilities such as Meltdown and Spectre which exploited especulative execution in AMD, Intel and ARM processors.

What this means to me as a consumer is that the semiconductor industry is strugling to maintain the rate of growth proposed by Moore’s Law and is begining to face scalability issues when dealing with multiple CPU cores that manifest in less reliable products and higher temperatures.

This is one of the cascading challenges that Vant engineers might have faced while designing this laptop.

I accepted the fact that computers would not get significantly faster years ago.

The limit to the potential of what we could do with computers were not the computer themselves which already where pushing physical boundaries, but what we could practicaly design with them using efficient software.

If we want faster software we could just stop using JavaScript and use programming languages such as Go, C++ or Rust that instead of being interpreted are compiled to binaries that modern processors are good at understanding and executing.

I also believe there are a lot of opportunities for improvement in the area of concurrent computation and GPU paralellism.

However, it is also a fact that the pipeline depth in most processors has exceeded our practical ideas of what to do with it resulting in pipeline stalls, which is why every competitive processor manufacturer is using speculative execution in the first place.

I am not a CPU manufacturer expert and to make a decision of the future of silicon requires deep understanding into the physics and economics involved that I will not pretend to trick you into thinking that I understand.

But what I understand is that intel has been doing a bunch of shitty processors lately and that they need to be punished by the market.

My two other options are ARM and AMD laptops.

ARM architecture, although show promise with the M1 and M2 of Apple and the Qualcomm’s Snapdragon processor, is still unstable to use as a Linux daily driver for development purposes.

The remaining choice is to stay in x86 get an AMD Ryzen processor in a gaming build with a damn good heat sink, which is what my next laptop will most likely have.

Disaster recovery

We could analyze this situation from a computer security perspective.

The largest threat to any computer system is paradoxically the people that manage it, so the first measures that need to be taken to keep the availability of any system is to protect against fuckups (such as mine) by their administrators, and then prepare against other internal threats such as employees or external ones.

The point at which I began to have the first hard issues with the laptop started 24 hours before its ultimate destruction, which was a complete and inmediate system shutdown.

In terms of disaster preparation, I make periodic backups each week in an encrypted hard disk kept offline using a program called borg which I strongly reocmmend, so data loss was relatively limited.

I also was very calm in relation to my software projects, everything that is worth to be saved is already version controlled in a remote repository.

Restoration procedures could be improved, I wasted one day configuring the backup laptop.

I have been playing with the idea of using NixOS in my daily driver precisely to avoid this, but using a declarative operating system involves reducing the availability and flexibility that an operating system with access to the AUR provides, plus I am still dependent on apps that do not go well with the Nix way of doing things.

The daily driver has to be a machine that gets shit done.

For some it is a Macbook or a Windows machine, for me, it is a Linux machine with access to the AUR and up-to-date packages.

However, one of the ideas that can ease up configuration management without going to the extreme that NixOS goes is to think in terms of different environments for the configuration.

Most developers do this at the project level via Python virtual environments or Docker containers, some of them use whole virtual machines via Vagrant to manage the dependencies used to build their projects.

This way you may still have the best of both worlds, a flexible machine that can be hacked away to get shit done while managing very well-defined and replicable environments for your projects and applications which are backed upand version controlled separetely.

I would guess that the same philosophy can be easily applied to cloud environments, although the scale and costs involved is of course another matter to takle in another article.

Conclusion

As hard as a disaster and loss can be, it is still an experience we can learn from, we just need to be prepared enough so that it doesn’t kill us.

Always audit the design of any system you decide to rely on and develop a proportional backup and recovery procedure, do not let yourself get tricked.

Having good backup and recovery techniques and mechanisms are not as sexy as having a distributed network of intrussion detection systems or making a security audit, but for most people they are the best security mechanism they can have against widespread threats such as ransomware attacks.

SynkOps

The Cost of Bad Design and Disaster Recovery

The consequences of bad design

Disaster recovery

Conclusion

whoami