Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

  1. Blog
  2. Article

robgibbon
on 18 August 2021

How to run Apache Spark on MicroK8s and Ubuntu Core, in the cloud: Part 1


There are some great reasons to use nested virtualisation on compute clouds. Nested virtualisation is when you have Virtual Machines running within Virtual Machines. This approach offers a stronger isolation model than containers: by using a dedicated kernel and IO devices, bad behaviour cannot easily break out and spread. In fact, nested virtualisation is at the heart of Firecracker (which is the foundation for AWS Lambda) and the Kata Containers project. And recent work on streamlining device drivers, kernel footprint and VM image size have meant that the performance and startup times are pretty amazing. But surely the main reason to use nested virtualisation on the cloud, is just because you can!

There are some limitations, though. The biggest limitation is that, while GCP and Azure offer varying levels of support for nested virtualisation, AWS only offers support for it on bare metal instances. There are a couple of solutions that claim to work around that limitation, like the Xen hypervisor running in paravirtualisation mode, which has been extended to minimize instances of kernel traps with the XenBlanket and Xen-Blanket-NG patchsets. But for this series of how-to articles, we’ll run nested virtualisation with the free and open source LXD instead. LXD can run nested, fully virtualised instances on GCP and Azure, and fall back to Linux Containers on AWS and any other clouds that don’t support it yet.

Let’s get started!

Spark up: building a Spark cluster

In this series of blog posts, we’ll build an Apache Spark cluster, running on MicroK8s on Ubuntu Core on the cloud to demonstrate the versatility of this approach. First, we will build up a basic solution locally. Then, in Part 2, we will have a go on the cloud. In Part 3, we’ll put Apache Spark on top. Finally, in Part 4, we’ll build a fully distributed MicroK8s compute cluster.

Ubuntu Core is a nifty new operating system that’s built from first principles with zero trust security in mind. Running MicroK8s on Ubuntu Core obviously brings those advantages of robust computing foundations to Kubernetes. And with MicroK8s, it’s quite literally a snap to set up a K8s cluster.

Apache Spark is the data engineers’ tool of choice for data lifting, transforming, loading, analysing and so much more. Folks are even busy hooking it up to data visualization and exploration tools like Apache SuperSet. By the end of this four-part series, we’ll have a working Spark setup that we can use for interactive data science.

But before we build our Spark cluster, we need to build up the foundational services for the solution. We need Ubuntu Core.

Core to the fore: introducing Ubuntu Core

Ubuntu Core is Canonical’s free and open source next-generation operating system based on snapd. In Ubuntu Core, everything is a snap. Even the kernel is a snap! Ubuntu Core uses snapd’s granular, AppArmor-driven and enforced sandbox permissions system, called confinement. Only snaps that implement the strict confinement model can be installed on Ubuntu Core. That means it’s harder for a rogue process to break out from the sandbox and do bad things across your system and beyond. We’ll be using Ubuntu Core for this project.

Let’s launch an Ubuntu Core VM on our local system using Multipass, install LXD on it, and then blast MicroK8s over the top. We’ll do some configuration tweaking along the way to make it all sing and dance nicely. Use the following commands:

sudo snap install multipass
multipass launch -m 12G -d 60G -n ubu-core core18
multipass shell ubu-core
sudo snap install lxd

That was quite easy! The VM might update itself and reboot a few times, but don’t worry about that. If a package installation is in progress while the reboot takes place, the installation will be rolled back and then resumed.

Lucy in the X with Diamonds: LXD and Linux Containers

LXD is a sophisticated, next-generation cloud management and orchestration engine with hypervisor and Linux Container engine support built in. With LXD, you can run Linux Container workloads, Linux VMs, or other operating systems like FreeBSD or Microsoft Windows if you prefer. Note that, for this first demo, we’ll use Linux Containers, but when we do this on the cloud, we’ll use nested virtualisation to benefit from the enhanced security of a dedicated kernel and virtual IO device sandboxing. It’s all possible, let’s just roll with it.

Let’s set up LXD to run MicroK8s. MicroK8s is the awesome new easy-peasy, lemon squeezy way to deploy Kubernetes. It can run standalone on a workstation, or banded together as a highly available cluster, and it’s a piece of pie to set up, even in HA mode. Oh and it’s also free open source software. For this post, we’re going to run it standalone, but you can learn more about MicroK8s on its website, including how to set it up as a full-on distributed compute cluster.

First, we’ll initialize our LXD environment so that we provision 40GB of space on a ZFS filesystem, mounted as a loopback to a file, with automatic IP addressing and bridged networking. Of course, this is a demo; when we do this for real on the cloud, we’ll allocate a dedicated block device rather than a file, for better performance. Use the following commands:

cat > lxd-init.yaml <<EOF
config: {}
networks:
- config:
    ipv4.address: auto
    ipv6.address: auto
  description: ""
  name: lxdbr0
  type: ""
  project: default
storage_pools:
- config:
    size: 40GB
  description: ""
  name: default
  driver: zfs
profiles:
- config: {}
  description: ""
  devices:
    eth0:
      name: eth0
      network: lxdbr0
      type: nic
    root:
      path: /
      pool: default
      type: disk
  name: default
projects: []
cluster: null
EOF

cat lxd-init.yaml | sudo lxd init --preseed
rm lxd-init.yaml

Now for the next step. We’ll make a Linux Container profile especially tuned for MicroK8s:

sudo lxc profile create microk8s

cat > microk8s.profile <<EOF
config:
  boot.autostart: "true"
  linux.kernel_modules: ip_vs,ip_vs_rr,ip_vs_wrr,ip_vs_sh,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay,br_netfilter,nf_conntrack_ipv4
  raw.lxc: |
    lxc.apparmor.profile=unconfined
    lxc.mount.auto=proc:rw sys:rw cgroup:rw
    lxc.cgroup.devices.allow=a
    lxc.cap.drop=
  security.nesting: "true"
  security.privileged: "true"
  security.syscalls.intercept.bpf: "true"
  security.syscalls.intercept.bpf.devices: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
description: ""
devices:
  aadisable:
    path: /sys/module/nf_conntrack/parameters/hashsize
    source: /sys/module/nf_conntrack/parameters/hashsize
    type: disk
  aadisable1:
    path: /sys/module/apparmor/parameters/enabled
    source: /dev/null
    type: disk
  aadisable2:
    path: /dev/zfs
    source: /dev/zfs
    type: disk
  aadisable3:
    path: /dev/kmsg
    source: /dev/kmsg
    type: disk
  aadisable4:
    path: /sys/fs/bpf
    source: /sys/fs/bpf
    type: disk
name: microk8s
used_by: []
EOF

cat microk8s.profile | sudo lxc profile edit microk8s
rm microk8s.profile

The next step is to fire up a Linux container and install the MicroK8s snap in it. After that, we’ll do some tweaks to get it running smoothly:

sudo lxc launch -p default -p microk8s ubuntu:20.04 microk8s
sleep 10

sudo lxc exec microk8s -- sudo snap install microk8s --classic
sudo lxc shell microk8s

cat > /etc/rc.local <<EOF
#!/bin/bash

apparmor_parser --replace /var/lib/snapd/apparmor/profiles/snap.microk8s.*
exit 0
EOF

chmod +x /etc/rc.local
systemctl restart rc-local

echo 'L /dev/kmsg - - - - /dev/null' > /etc/tmpfiles.d/kmsg.conf
echo '--conntrack-max-per-core=0' >> /var/snap/microk8s/current/args/kube-proxy
exit

sudo lxc restart microk8s
sleep 15

sudo lxc exec microk8s -- sudo swapoff -a

Drop a bot: testing with Microbot

Alright – we’re all set to deploy some stuff on Kubernetes! Let’s drop a Microbot on there, and make sure that we can see it from outside of the LXD environment:

sudo lxc exec microk8s -- sudo microk8s.kubectl create deployment microbot --image=dontrebootme/microbot:v1

sudo lxc exec microk8s -- sudo microk8s.kubectl expose deployment microbot --type=NodePort --name=microbot-service --port=80

MICROBOT_PORT=$(sudo lxc exec microk8s -- sudo microk8s.kubectl get all | grep service/microbot | awk '{ print $5 }' | awk -F':' '{ print $2 }' | awk -F'/' '{ print $1 }')

UK8S_IP=$(sudo lxc list microk8s | grep microk8s | awk -F'|' '{ print $4 }' | awk -F' ' '{ print $1 }')

sudo snap install curl
curl http://$UK8S_IP:$MICROBOT_PORT/

If you’ve seen some microbot HTML in your terminal, then you’re all good – it’s all working as planned. If not, you might need to check back and review the previous steps carefully to get it running. It’ll be even better when this thing is running on the cloud though!

We’ll get to that in Part 2 of this blog post. Stay tuned…

Related posts


Giulia Lanzafame
10 December 2024

Spark or Hadoop: the best choice for big data teams?

Data Platform Article

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition. When the Olympics took place in Paris last summer, I suddenly began rooting for my country in sports I barely knew existed. I would spend random ...


robgibbon
23 May 2024

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

AI Article

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a multiplayer LAN party? Although I’m no data scientist, I was able to get this to work and I’ll show you how so ...


robgibbon
22 February 2024

Migrating from Cloudera to a modern data hub architecture

Data Platform Article

In the early 2010s, Apache Hadoop captured the imagination of the tech community. A free and powerful open source platform, it gave users a way to process unimaginably large quantities of data, and offered a dazzling variety of tooling to suit nearly every use case – MapReduce for odd jobs like processing of text, audio ...