środa, 3 października 2018

HexIT Escape Room for IT geeks - escape if you can (Wrocław, Poland)

I'm glad to announce that we have launched an escape room that targets IT people (developers, testers etc). So far it works well for one month and about 30 teams (3-4 people) have already enjoyed it.

Please yourself and pay us a visit! Basing on the reactions of other teams I can guarantee remarkable experience. You don't need to be a "hackerman" to complete the room, but if you are you will do so faster ;-). Teams can be mixed too (but at least one person with basic programming skills is rather required).



Room location & partnership with Let Me Out.

ul. Bernardyńska 4 (close to Galeria Dominikańska), IInd floor (map below)

Link to Google Maps


Book here: letmeout.pl (select Wrocław)

I'm the author and creator of the room but I'm not running the business. The company that operates the room is Let Me Out and has excellent portfolio of other escape rooms in lot of Polish cities and Brussels.

Room theme
It goes like this:
Another country is trying to become an atomic superpower through the development of nuclear weapons, which consequently results in the destabilization of the region and the escalation of the international conflict on an unprecedented scale. The world is on the verge of the outbreak of World War III. The only salvation is to infect the secret plantation of uranium treatment with a computer virus. Will a group of programmers be able to prevent nuclear war in 90 minutes?

I received some suggestions that the room itself should be marketed as "an ordinary escape room with extra IT riddles". And this is actually what I wanted to build. Not to give a desk, PC and Jira for the players, but give them a nice mix of good background story with many different IT riddles.
Solve the riddles while saving the world! :-)

Easter Egg
In the room there are some easter eggs. One of them will let you listen to some famous song. The code is what comes out of `1900 + 80 + 9` and you need to properly enter it. You'll know where once you're there ;)

Room name
Funny fact about the name: it incorporates four things:
  • Hex, as a reference to hexadecimal numbers (you'll see some of them ;-)
  • Hex, as a uranium industry jargon name for Uranium Hexafluoride
  • Exit, related to Escape word
  • IT - information technology

piątek, 17 sierpnia 2018

Global app variables in connexion & aiohttp


tl;dr: use pass_context_arg_name and api.subapp

Nowadays microservice architecture seem to be the default way distributed applications are build. Also, people started to treat APIs as a first-class citizen. Hence, it's no surprise that projects like Swagger/OpenAPI are gaining popularity on a daily basis.

One of Python OpenAPI implementations that I discovered recently is Connexion. Advantages of using OpenAPI are obvious: e.g. you can decouple endpoints schema from app logic and have only single place where whole API is described. Even the fact that there's Swagger UI for API users can be quite beneficial.

In the past I've been looking at different frameworks like django-rest, but nothing seemed as simple as Connexion. I decided to play it with right after discovering that the guys from Zalando added support for aiohttp (asynchronous HTTP server) - the framework we use extensively in our projects.

So what's the problem? What this post is about? Although Connexion is great, it is undocumented (or my DuckDuckGo-foo sucks and this is in fact just not well-documented) how to glue it with how global variables are handled in aiohttp - using app as an container for globals. Consider following snippet:

async def handler(request):
  # this is how aiohttp creators recommend to access global variables
  # e.g. database handle
  request.app['redis_con'].incr('visits')
  return web.Response(body=b'hello')

Nothing much more than ordinary aiohttp handler that uses redis_con global. Unfortunately using globals with Connexion is not that straightforward. Example how Connexion handlers look like (following comes from Connextion docs):

def example(name: str) -> str:
  return 'Hello {name}'.format(name=name)



There's no request parameter! It took me some time to find out how to let Connexion pass request (aiohttp context) to handlers. I had to dig into source code to figure out following:

def start(redis_con):
    app = connexion.AioHttpApp(__name__, specification_dir='swagger/')
    api = app.add_api('api.yml', pass_context_arg_name='request')
    api.subapp['redis_con'] = redis_con
    app.run()


We're passing pass_context_arg_name parameter and it turns out that for aiohttp the context is the request. The unintuive thing is that subapp part. We need to use it in order to set global. This part I have found in aiohttp_jinja2.setup function. Now, we can use it in handlers like following.

async def handler(*args, **kwargs):
  kwargs['request'].app['redis_con'].incr('visits')
  return web.Response(body=b'hello')

That's all. Seems like easy thing, but nowhere online could I find it.

sobota, 21 lipca 2018

Handling multiple identical USB ethernet adapters (Raspberry PI, udev)

You have to build simple ethernet-connected chain of devices and continuously check that it's healthy. In order to save money and time you decide to replace individual devices (say Raspberries) with multiple USB ethernet adapters. You buy Chinese ones. What could go wrong?

We're building an escape room. There's plenty of them in Wrocław but our is special, because it's dedicated for IT guys. Random people would have lot of trouble solving even first riddles. These riddles are supposed to be great fun for tech people.

I don't want to spoil what are the riddles. Let us stay with the technical problem that I had at hand. Multiple devices need to be accessible in some specific configuration to solve one of the riddles. It made no sense to have these devices if their only purpose was to respond to some ICMP packet (certainly there is even more low-level solution, but we need something easy and reliable now). We decided to limit number of these and to attach USB ethernet adapters to each. My colleague has bought some Chinese adapters like on picture below and problems emerged immediately.


BTW the funny fact about CE marks on some devices (I'm not sure about this one) may not actually be CE marks but "China Export". You can read more about it here.

Perfect hardware clones!

So what's the problem? Well... when I firstly plugged in first adapted I made some configuration changes in Raspbian and was happy that everything works flawlessly. However, couple of days after I connected second adapter to the same device and it was the time when the problem surfaced. All of these USB adapters had the same MAC address. To make it even worse, after inspecting what's in /sys, I was sure that all of the USB parameters are also identical. In other words these devices were perfect clones. ROM was the same for all of them! And btw one out of 8 was not working at all.

Why this is a problem? It's because if the names are the same, kernel will rename network interface name to something like rename{number} and there's no reliable way to tell which interface is connected to which cable. Sadly, they also share the same MAC, so if you connect all adapters to the same switch, funny things will start to happen!

Ubootdev for the rescue

I'm not that into Linux, but I immediately knew where to look for - udev. I was afraid that there won't be a way to differentiate between adapters at udev level and I was right.

However, some silly (maybe not silly. If something is silly but it works it means it's not silly ;-) solution is possible: differentiate USB ports rather than the devices themselves.

I started to read documentation and have found that you can create rules based on ports, like following:

SUBSYSTEM=="net", KERNELS=="1-2:1.0", ATTR{address}=="00:e0:4c:53:44:58"

net is the subsystem we want. USB port must be provided in KERNELS parameter (S at the end is both intentional and crucial). By providing address attribute you may further target only these Chinese adapters you have on the desk.

Finding out usb ports proved to be a little tricky task. You can do it using udevadm utility.
I have prepared diagram for my RPi 3:



Please take note that this may be different in your case. The reason is that it all depends on:
  • hardware revision
  • firmware versions
  • kernel version
  • kernel modules version
Once we know these USB "addresses" we can write rules. Rules are below. I'd like to additionally emphasize two things:
  • you can target using ATTR{address}=="mac-here", but apparently there's no way to change it (ATTR{address}="new-mac" doesn't work)
  • changing MAC address is still possible (e.g. ifconfig <ifname> hw ether ...) and you can even use the name you set, but you must use absolute paths to executable!

SUBSYSTEM=="net", KERNELS=="1-1.2:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek1", RUN+="/sbin/ifconfig kabelek1 hw ether 00:e0:4c:00:00:01"
SUBSYSTEM=="net", KERNELS=="1-1.4:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek2", RUN+="/sbin/ifconfig kabelek2 hw ether 00:e0:4c:00:00:02"
SUBSYSTEM=="net", KERNELS=="1-1.3:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek3", RUN+="/sbin/ifconfig kabelek3 hw ether 00:e0:4c:00:00:03"
SUBSYSTEM=="net", KERNELS=="1-1.5:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek4", RUN+="/sbin/ifconfig kabelek4 hw ether 00:e0:4c:00:00:04"


And voile-a! You are free to connect lot of adapters to single Raspberry. You still need to maintain USB-port and Ethernet cables coupling and also you will need to do something with the cables ;)


This is how my desk looked like when I was figuring things out.

To summarize, almost everything can be done and if something really can't, then you somehow can circumvent. However I believe this trick is just palliative. Chinese adapters can backfire any time, so if you require reliability, then you should look for other hardware.

wtorek, 2 stycznia 2018

Preconfigured Jenkins cluster in Docker Swarm (proxy, accounts, plugins)

In recent years lot of popular technologies were adjusted so they can run in Docker containers. Our industry even coined new verb - dockerization. When something is dockerized we usually expect it to behave like self-contained app that is controlled with either command line switches or environment variables. We also assume that apart of this kind of customization the dockerized thing is zero-conf - it will start right away with no further magic spells.

It's just awesome when things work that way. Unfortunately there are exceptions and Jenkins is one of them. The problem with Jenkins is that even when you start it from within a container, you still need to:
  • open configuration wizard (it's a web page)
  • prove that you're the guy: pass it's challenge by reading some magic file and pasting its content into configuration wizard
  • configure proxy, if you're behind one
  • select plugins to be installed during initialization
  • setup admin account
Pretty bad. It resembles installation wizard like in Windows. Phew. Couple of weeks ago I was trying to check out how well Jenkins would solve one of our data transformation (ETL) problem and was unsure how many times it will be deployed. Hence I needed to do something about this installation process so it sucks less. All of the building blocks were already on the table: Terraform, Ansible and Docker Swarm. The missing part was pre-configured dockerized Jenkins running in the Swarm.

So this post, in DuckDuckGo-friendly list, explains how to:
  • pre-configure Jenkins with custom user (admin) account
  • pre-configure Jenkins with a proxy
  • pre-configure Jenkins with specified plugins
  • run Jenkins master and slaves entirely in Docker Swarm with Jenkins' own Swarm plugin for automatic master-slave connection establishment
  • allow Jenkins jobs to execute other Docker containers nearby (the daemon's sock trick)





 http://www.rustypants.net/wp-content/uploads/2008/10/satanspbeach.jpg


Abandon all hope, ye who enter here.

I remember that in one of C projects (not sure what was it, but perhaps something from GNU, maybe RMS update: it was xterm) there was this comment "abandon all hope, ye who enter here". It also mentioned how many people have ignored this warning and tried to refactor something. I have the same reflections w.r.t. configuring Jenkins without custom Groovy scripts. I was reluctant to learn new language, but eventually this seemed like the most reasonable way to continue.

Of course, all of following problems can be solved in a troglodyte way too. E.g. you can configure by hand, extract Jenkins home directory, targz it and re-use. But that brings couple of other problems. Also, surprisingly fresh Jenkins home weighted about 70MBs in my case. I always thought that it's just bunch of XML files, but perhaps it's not that straightforward. Since primitive solutions didn't work right away, I decided to stop for a while and try to solve the problem "the right way".

System overview & requirements.

System is simple: there's one master (and it's an brilliant example of a SPOF, but nobody cares, since you're unsure of future) and number of workers (slaves). We want workers to register to the master automatically. Unfortunately this is not possible using plain JNLP solution, because you need to register the worker in master prior to establishing a link. In theory you could do some curl magic, but fortunately there's a plugin that does it for you - Jenkins Swarm (not to be confused with Docker Swarm, as it has literally nothing to do with it). Jenkins Swarm plugin consists of two things: a plugin for master Jenkins and Java JAR for slaves.
So we're set up. Jenkins Swarm will take care of auto-connecting slaves. Now, we must run dockerized version of these slaves and put it to Docker Swarm. But before we talk about slaves, let's handle the master.

Jenkins master with plugins, proxy, and extra configuration.

Let me paste Dockerfile and explain it line by line.

FROM jenkins/jenkins:2.89.1-alpine

ARG proxy
ENV http_proxy=$proxy https_proxy=$proxy

USER root
RUN apk update && apk add python3
COPY requirements.txt /tmp/requirements.txt
RUN pip3 install -r /tmp/requirements.txt

USER jenkins

COPY plugins.txt /plugins.txt
RUN /usr/local/bin/install-plugins.sh swarm:3.6 workflow-aggregator:2.5

ENV JAVA_OPTS="-Djenkins.install.runSetupWizard=false"
COPY security.groovy /usr/share/jenkins/ref/init.groovy.d/security.groovy
COPY proxy.groovy /usr/share/jenkins/ref/init.groovy.d/proxy.groovy
COPY executors.groovy /usr/share/jenkins/ref/init.groovy.d/executors.groovy

We must start with some Jenkins image in order to customize it. In my case that's slim Alpine Linux version 2.89.1. Then there's build argument for the proxy. You can ignore this part if you're not behind one.

Before we modify the image, we need to switch to root user. After we're done we should switch it back to jenkins fo better security (if you wonder how to check it without base image Dockerfile, docker history command is your friend). In my case I'm also installing some python3 stuff defined in requirements.txt dependency file. If you're not willing to add any package to the system, you can skip this entire part too.

Then, we approach configuring plugins. In different places in Internet you can find an advice to use /usr/local/bin/plugins.sh but believe me you don't want to do this, as this installs plugins without their dependencies. Newer install-plugins.sh script takes care of dependencies for you. In our case we're installing two plugins. You might want to install just the essential one - the swarm plugin.

Now, four nonstandard lines. I believe that setting runSetupWizard to false is self-explanatory. The rest of lines are there for account setup, proxy configuration and executors configuration.

Let's start with setting up admin account. Groovy here we go!

#!groovy

import jenkins.model.*
import hudson.security.*
import jenkins.security.s2m.AdminWhitelistRule

def instance = Jenkins.getInstance()

def user = new File("/run/secrets/jenkinsUser").text.trim()
def pass = new File("/run/secrets/jenkinsPassword").text.trim()

def hudsonRealm = new HudsonPrivateSecurityRealm(false)
hudsonRealm.createAccount(user, pass)
instance.setSecurityRealm(hudsonRealm)

def strategy = new FullControlOnceLoggedInAuthorizationStrategy()
instance.setAuthorizationStrategy(strategy)
instance.save()

Jenkins.instance.getInjector().getInstance(AdminWhitelistRule.class).setMasterKillSwitch(false)

I'm not Groovy expert so don't judge me by the code above. I have started with just knowledge that it runs over JVM :). It's actually looks like nice managed language. The good part is that, as in Python, the code mostly speaks for itself. Hudson Legacy is visible here as well. I won't go into details - if you want to know from where all of this magic comes, pay a visit to official docs. Don't forget that you can also use infamous Jenkins console. I found Groovy's dump built-in very helpful too.
So the above script will actually setup an admin account, but doesn't hardwire anything. Both username and password come from Docker Secrets that allows you to manage sensitive data in your Swarm cluster nicely.

Now, the second script is for proxy:

#!groovy

import jenkins.model.*
import hudson.*

def instance = Jenkins.getInstance()
def pc = new hudson.ProxyConfiguration("1.2.3.4", 8080, null, null, "localhost,*.your.intranet.com");
instance.proxy = pc;
instance.save()

Here's some magic too. It sets up proxy 1.2.3.4:8080 but with specified exceptions. Then it modifies Jenkins instance (which seem to be a singleton).

And finally, executors part. I wanted this one so master is not used as a worker at all.

import jenkins.model.*
Jenkins.instance.setNumExecutors(0)

Slaves.

Now, since the master is ready, let's configure slaves. Their Dockerfile is as follows.

FROM docker:17.03-rc

ARG proxy
ENV https_proxy=$proxy http_proxy=$proxy no_proxy="localhost,*.your.intranet.com"

RUN apk --update add openjdk8-jre git python3
RUN wget -O swarm-client.jar http://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/swarm-client/3.3/swarm-client-3.3.jar

ENV http_proxy= https_proxy=
COPY entrypoint.sh /
RUN chmod +x /entrypoint.sh
CMD ["/entrypoint.sh"]

This time base image is docker, because we want to have docker installed within this docker container (so this container can spawn other containers). After setting proxies (the part that is not mandatory) we must download Java Runtime Environment version 8 and download swarm-client JAR. I'm using version 3.3 which is accessible through URL as for today.
Finally, there's an entrypoint that will execute swarm-client and do all the magic, but it heavily relies on Docker Secret named jenkinsSwarm, which should look like following.

-master http://master_address:8080 -password jenkinsUser -username jenkinsPassword


Here master_address must be known to slave machines (e.g. in /etc/hosts, Consul or something). You should also include username and password - the same ones that you share in other Docker Swarm secrets.

If you're using Ansible like I do, it's pretty straightforward to utilize variables instead not to hardcode credentials. For instance ansible-vault can be used for this.

entrypoint.sh itself is almost one-liner:

mkdir /tmp/jenkins
java -jar swarm-client.jar -labels=docker -executors=1 -fsroot=/tmp/jenkins -name=docker-$(hostname) $(cat /run/secrets/jenkinsSwarm)

It assumes that it's running in the Swarm and can access /run/secrets/jenkinsSwarm (the line that's pasted above).

Glueing it all together.

Building blocks are already in place. Now it's time to glue everything together. I don't want to go into details here, because this is not primary topic of this blog post. If you're interested in how personally I did everything please let me know in comments, so I will create GitHub repo. Let me however give you some important hints:
  • if you want slave to be able to spawn other containers (on the same host on which the slave is running), you must bind mount docker.sock file, e.g. like this: "/var/run/docker.sock:/var/run/docker.sock". There's more to this, though! Docker daemon will not allow jenkins user to spawn containers, so you must somehow circumvent this problem. I'm circumventing this by adding jenkins user to docker group, but this works only because there's 1:1 mapping between the host and container.
  • you should have three secrets in Docker Swarm cluster: jenkinsUser, jenkinsPassword and jenkinsSwarm with username, password, and swarm-client.jar arguments respectively
  • machines must be able to communicate. For internal JNLP communication, port 50000/tcp must be opened.
  • if you set deployment mode to global in docker-compose.yml file (if you're using one), then you will have as much slaves as machines in the cluster, which can be nice
  • if you're gonna stick to this solution for a longer period of time I recommend to think about horizontally scaling out and in: it should be as simple as adding/removing machines from the cluster: just one terraform command followed by ansible-playbook spell.


Hopefully this post helps you with setting up Jenkins cluster that simply works. If you'd like to see the code, let me know in comments!

piątek, 8 grudnia 2017

Airflow Docker with Xcom push and pull

Recently, in one projects I'm working on, we started to research technologies that can be used to design and execute data processing flows. Amount of data to be processed is counted in terabytes, hence we were aiming at solutions that can be deployed in the cloud. Solutions from Apache umbrella like Hadoop, Spark, or Flink were at the table from the very beginning, but we also looked at others like Luigi or Airflow, because our use case was neither MapReducable nor stream-based.

Airflow caught our attention and we decided to give it a shot just to see if we can create PoC using it*. In order to execute PoC faster rather than slower, we planned to provision Swarm cluster for this.

In the Airflow you can find couple of so-called operators that allow you to execute actions. There are operators for Bash or Python, but you can also find something for e.g. Hive. Fortunately there is also Docker operator for us.

Local PoC
PoC started on my laptop and not in the cluster. Thankfully, DockerOperator allows you to pass URL to docker daemon, so moving from laptop to cluster is close to just changing one parameter. Nice!

If you want to run Airflow server locally from inside container, and have it running as non-root (you should!) and you bind docker.sock from host to the container, you must create docker group in the container that mirrors docker group on your host and then add e.g. airflow user to this group. That does the trick...

So just running DockerOperator is not black magic. However, if your containers need to exchange data it starts to be a little bit more tricky.

Xcom push/pull
The push part is simple and documented. Just set xcom_push parameter to True and last line of container stdout will be published by Airflow as it was pushed programatically. It looks that this is natural Airflow way.

Pull is not that obvious. Perhaps because it's not documented. You can't read stdin or something. The way to do this involves joining two dots:
  • command parameter can be Jinja2-templated
  • one of the macros allows you to do xcom_pull
So you need to prepare your containers in a special way so they can pull/push. Let's start with a container that pushes something:

FROM debian
ENTRYPOINT echo '{"i_am_pushing": "json"}'


Simple enough. Now pulling container:

FROM debian
COPY ./entrypoint /
ENTRYPOINT ["/entrypoint"]


Entrypoint script can be whatever and will get the JSON as $1. Crucial (and also easy to miss) thing that is required for it to work is that ENTRYPOINT must use exec form. Yes, there are two forms of ENTRYPOINT. If you use the one without array, then parameters will not be passed to the container!

Finally, you can glue things together and you're done. The ti macro allows us to get data pushed by other task. ti stands for task_instance.

dag = DAG('docker', default_args=default_args, schedule_interval=timedelta(1))

t1 = DockerOperator(task_id='docker_1', dag=dag, image='docker_1', xcom_push=True)

t2 = DockerOperator(task_id='docker_2', dag=dag, image='docker_2', command='{{ ti.xcom_pull(task_ids="docker_1") }}')

t2.set_upstream(t1)



Conclusion
Docker can be used in Airflow along with Xcom push/pull functionality. It isn't very convenient and is not well documented I would say, but at least it works.

If time permits I'm going to create PR for documenting pull op. I don't know how it works out, because in Airflow GH project there are 237 PRs now and some of them are there since May 2016!


* the funny thing is that we considered Jenkins too! ;-)

środa, 29 listopada 2017

Tests stability S09E11 (Docker, Selenium)


If you're experienced in setting up automated testing with Selenium and Docker you'll perhaps agree with me that it's not the most stable thing in the world. Actually it's far far away from any stable island - right in the middle of "the sea of instability".

When you think about failures in automated testing and how they develop when the system is growing it can resemble drugs. Seriously. When you start, occasional failures are ignored. You close your eyes and click "Retry". Innocent. But after some time it snowballs into a problem. And you find yourself with a blind fold put on but you can't remember buying it.

This post is small story how in one of small projects we started with occasional failures and ended up with... well... you'll see. Read on ;).



For past couple of months I was thinking that "others have worse setups and live", but today it all culminated, I have achieved fourth degree of density and decided to stop being quiet.

Disclaimer
In the middle of this post you might start to think that our environment is simply broken. That's damn right. The cloud in which we're running is not very stable. Sometimes it behaves like it had a sulk. There are problems with proxies too. And finally we add Docker and Selenium to the mixture. I think testimonial from one of our engineers sums it all:
if retry didn’t fix it for the 10th time, then there’s definitely something wrong
And now something must be noted as well. The project I'm referring to is just a sideway one. It's an attempt to innovate some process, unsupported by the business whatsoever.

The triggers
I was pressing "Retry" button for another time on two of the e2e jobs and saw following.

// job 1
couldn't stat /proc/self/fd/18446744073709551615: stat /proc/self/fd/23: no such file or directory

// job 2
Service 'frontend' failed to build: readlink /proc/4304/exe: no such file or directory


What the hell is this? We have never seen this before and now apparently it became a commonplace in our CI pipeline (it was nth retry).


So this big number after /fd/ is 64-bit value of -1. Perhaps something in Selenium uses some function that returns an error and then tries to call stat syscall, passing -1 as an argument. Function return value was not checked!
The second error message is most probably related to docker. Something tries to find where is executable for some PID. Why?

"Retry" solution did not work this time. Re-deploying e2e workers also didn't help. I thought that now is the time when we should get some insights into what is actually happening and how many failures were caused by unstable environment.

Luckily we're running on GitLab, which provides reasonable API. Read on to see what I've found. I personally find it hilarious.

Insight into failures
It's extremely easy to make use of GitLab CI API (thanks GitLab guys!). I have extracted JSON objects for every job in every pipeline that was recorded in our project and started playing with the data.

The first thing that I checked was how many failures there are per particular type of test. Names are anonymized a little because I'm unsure if this is sensitive data or not. Better safe than sorry!

Fig 1: Successful/failed jobs, per job name
I knew that some tests were failing often, but these results tell that in some cases almost 50% of the jobs fail! Insane! BTW we recently split some of long-running e2e test suites into smaller jobs, which is observable from the figure.
But now we can argue that maybe this is because of the bugs in the code. Let's see. In order to tell this we must analyze data basing on commit hashes: how many commits in particular jobs were executed multiple times and finished with different status. In other words: we look for the situations in which even without changes in the code the job status was varying.

The numbers for our repository are:
  • number of (commit, job) pairs with at least one success: 23550
  • total number of failures for these pairs: 1484

In other words, unstable environment was responsible for at least ~6.30% of observable failures. It might look like small number, but if you take into account that single job can last for 45 minutes, it becomes a lot of wasted time. Especially that failure notifications aren't always handled immediately. I also have a hunch that at some time people started to click "Retry" just to be sure the problem is not with the environment.

My top 5 picks among all of these failures are below.


hash:job           | #tot | success/fail | users clicking "Retry"
----------------------------------------------------------------
d7f43f9c:e2e-7     | 19   |  ( 1/17)     | user-6,user-7,user-5
2fcecb7c:e2e-7     | 16   |  ( 8/ 8)     | user-6,user-7
2c34596f:other-1   | 14   |  ( 1/13)     | user-8
525203c6:other-13  | 12   |  ( 1/ 8)     | user-13,user-11
3457fbc5:e2e-6     | 11   |  ( 2/ 9)     | user-14


So, for instance - commit d7f43f9c was failing on job e2e-7 17 times and three distinct users tried to make it pass by clicking "Retry" button over and over. And finally they made it! Ridiculous, isn't it?

And speaking of time I've also checked jobs that lasted for enormous number of time. Winners are:

job:status        |  time (hours)
---------------------------------
other-2:failed    |  167.30
other-8:canceled  |  118.89
other-4:canceled  |  27.19
e2e-7:success     |  26.12
other-1:failed    |  26.01


Perhaps these are just outliers. Histograms would give better insight. But even if outliers, they're crazy outliers.


I have also attempted to detect reason of the failure but this is more complex problem to solve. It requires to parse logs and guess which line was the first one indicating error condition. Then the second guess - about whether the problem originated from environment or the code.
Maybe such a task could be somehow handled by (in)famous machine learning. Actually there are more items that could be achieved with ML support. Two most simple examples are:
  • giving estimation whether the job will fail
    • also, providing reason of failure
    • if the failure originated from faulty environment, what exactly was it?
  • estimated time for the pipeline to finish
  • auto-retry in case of env-related failure

Conclusions
Apparently I've been having much more unstable e2e test environment than I ever thought. Lesson learned is that if you get used to solve problem by retrying you loose sense in how big trouble you are.

Similarly to any other engineering problem you first need to gather data and decide what to do next. Basing on numbers I have now I'm planning to implement some ideas to make life easier.

While analyzing the data I had moments when I couldn't stop laughing to myself. But the reality is sad. It started with occasional failures and ended with continuous problem. And we weren't doing much about it. The problem was not that we were effed in the ass. The problem was that we started to arrange our place there. Insights will help us get out.

Share your ideas in comments. If we bootstrap discussion I'll do my best to share the code I have in GitHub.