Efficient Proxmox Cluster Deployment through Automation with Ansible

Manually setting up and managing servers is usually time-consuming, error-prone, and difficult to scale. This becomes especially evident during large-scale rollouts, when building complex infrastructures, or during the migration from other virtualization environments. In such cases, traditional manual processes quickly reach their limits. Consistent automation offers an effective and sustainable solution to these challenges.

Proxmox is a powerful virtualization platform known for its flexibility and comprehensive feature set. When combined with Ansible, a lightweight and agentless automation tool, the management of entire system landscapes becomes significantly more efficient. Ansible allows for the definition of reusable configurations in the form of playbooks, ensuring that deployment processes are consistent, transparent, and reproducible.

To enable fully automated deployment of Proxmox clusters, our team member, known in the open-source community under the alias gyptazy, has developed a dedicated Ansible module called proxmox_cluster. This module handles all the necessary steps to initialize a Proxmox cluster and add additional nodes. It has been officially included in the upstream Ansible Community Proxmox collection and is available for installation via Ansible Galaxy starting with version 1.1.0. As a result, the manual effort required for cluster deployment is significantly reduced. Further insights can be found in his blog post titled “How My BoxyBSD Project Boosted the Proxmox Ecosystem“.

By adopting this solution, not only can valuable time be saved, but a solid foundation for scalable and low-maintenance infrastructure is also established. Unlike fragile task-based approaches that often rely on Ansible’s shell or command modules, this solution leverages the full potential of the Proxmox API through a dedicated module. As a result, it can be executed in various scopes and does not require SSH access to the target systems.

This automated approach makes it possible to deploy complex setups efficiently while laying the groundwork for stable and future-proof IT environments. Such environments can be extended at a later stage and are built according to a consistent and repeatable structure.

Benefits

Using the proxmox_cluster module for Proxmox cluster deployment brings several key advantages to modern IT environments. The focus lies on secure, flexible, and scalable interaction with the Proxmox API, improved error handling, and simplified integration across various use cases:

Use of the native Proxmox API
Full support for the Proxmox authentication system
- API Token Authentication support
No SSH access required
Usable in multiple scopes:
- From a dedicated deployment host
- From a local system
- Within the context of the target system itself
Improved error handling through API abstraction

Ansible Proxmox Module: proxmox_cluster

The newly added proxmox_cluster module in Ansible significantly simplifies the automated provisioning of Proxmox VE clusters. With just a single task, it enables the seamless creation of a complete cluster, reducing complexity and manual effort to a minimum.

Creating a Cluster

Creating a cluster requires now only a single task in Ansible by using the proxmox_cluster module:

- name: Create a Proxmox VE Cluster
  community.proxmox.proxmox_cluster:
    state: present
    api_host: proxmoxhost
    api_user: root@pam
    api_password: password123
    api_ssl_verify: false
    link0: 10.10.1.1
    link1: 10.10.2.1
    cluster_name: "devcluster"

Afterwards, the cluster is created and additional Proxmox VE nodes can join the cluster.

Joining a Cluster

Additional nodes can now also join the cluster using a single task. When combined with the use of a dynamic inventory, it becomes easy to iterate over a list of nodes from a defined group and add them to the cluster within a loop. This approach enables the rapid deployment of larger Proxmox clusters in an efficient and scalable manner.

- name: Join a Proxmox VE Cluster
  community.proxmox.proxmox_cluster:
    state: present
    api_host: proxmoxhost
    api_user: root@pam
    api_password: password123
    master_ip: "{{ primary_node }}"
    fingerprint: "{{ cluster_fingerprint }}"
    cluster_name: “devcluster"

Cluster Join Informationen

In order for a node to join a Proxmox cluster, the cluster’s join information is generally required. To avoid defining this information manually for each individual cluster, this step can also be automated. As part of this feature, a new module called cluster_join_info has been introduced. It allows the necessary data to be retrieved automatically via the Proxmox API and made available for further use in the automation process.

- name: List existing Proxmox VE cluster join information
  community.proxmox.proxmox_cluster_join_info:
    api_host: proxmox1
    api_user: root@pam
    api_password: "{{ password | default(omit) }}"
    api_token_id: "{{ token_id | default(omit) }}"
    api_token_secret: "{{ token_secret | default(omit) }}"
  register: proxmox_cluster_join

Conclusion

While automation in the context of virtualization technologies is often focused on the provisioning of guest systems or virtual machines (VMs), this approach demonstrates that automation can be applied at a much deeper level within the underlying infrastructure. It is also possible to fully automate scenarios in which nodes are initially deployed using a customer-specific image with Proxmox VE preinstalled, and then proceed to automatically create the cluster.

As an official Proxmox partner, we are happy to support you in implementing a comprehensive automation strategy tailored to your environment and based on Proxmox products. You can contact us at any time!

Introduction

Puppet is a software configuration management solution to manage IT infrastructure. One of the first things to be learnt about Puppet is its domain-specific language – the Puppet-DSL – and the concepts that come with it.

Users can organize their code in classes and modules and use pre-defined resource types to manage resources like files, packages, services and others.

The most commonly used types are part of the Puppet core, implemented in Ruby. Composite resource types may be defined via Puppet-DSL from already known types by the Puppet user themself, or imported as part of an external Puppet module which is maintained by external module developers.

It so happens that Puppet users can stay within the Puppet-DSL for a long time even when they deal with Puppet on a regular basis.

The first time I had a glimpse into this topic was when Debian Stable was shipping Puppet 5.5, which was not too long ago. The Puppet 5.5 documentation includes a chapter on custom types and provider development respectively, but to me they felt incomplete and lacking self contained examples. Apparently I was not the only one feeling that way, even though Puppet’s gem documentation is a good overview of what is possible in principle.

Gary Larizza’s blog post was more than ten years ago. I had another look into the documentation for Puppet 7 on that topic recently, as this is the Puppet version in current’s Debian Stable.

The Puppet 5.5 way to type & provider development is now called the low level method, and its documentation has not changed significantly. However, Puppet 6 upwards recommends a new method to create custom types & providers via the so-called Resource-API, whose documentation is a major improvement compared to the low-level method’s. The Resource-API is not a replacement, though, and has several documented limitations.

Nevertheless, for the remaining blog post, we will re-prototype a small portion of files functionality using the low-level method, as well as the Resource-API, namely the ensure and content properties.

Preparations

The following preparations are not necessary in an agent-server setup. We use bundle to obtain a puppet executable for this demo.

demo@85c63b50bfa3:~$ cat > Gemfile <<EOF
source 'https://rubygems.org'

gem 'puppet', '>= 6'
EOF

demo@85c63b50bfa3:~$ bundle install
Fetching gem metadata from https://rubygems.org/........
Resolving dependencies...
...
Installing puppet 8.10.0
Bundle complete! 1 Gemfile dependency, 17 gems now installed.
Bundled gems are installed into `./.vendor`

demo@85c63b50bfa3:~$ cat > file_builtin.pp <<EOF
$file = '/home/demo/puppet-file-builtin'
file {$file: content => 'This is madness'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply file_builtin.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.01 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-builtin]/ensure: defined content as '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: Applied catalog in 0.04 seconds

demo@85c63b50bfa3:~$ sha256sum /home/demo/puppet-file-builtin
0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e /home/demo/puppet-file-builtin

Keep in mind the state change information printed by Puppet.

Low-level prototype

Custom types and providers that are not installed via a Gem need to be part of some Puppet module, so they can be copied to Puppet agents via the pluginsync mechanism.

A common location for Puppet modules is the modules directory inside a Puppet environment. For this demo, we declare a demo module.

Basic functionality

Our first attempt is the following type definition for a new type we will call file_llmethod. It has no documentation or validation of input values.

# modules/demo/lib/puppet/type/file_llmethod.rb

Puppet::Type.newtype(:file_llmethod) do
  newparam(:path, namevar: true) {}
  newproperty(:content) {}
  newproperty(:ensure) do
    newvalues(:present, :absent)
    defaultto(:present)
  end
end

We have declared a path parameter that serves as the namevar for this type – there cannot be other file_llmethod instances managing the same path. The ensure property is restricted to two values and defaults to present.

The following provider implementation consists of a getter and a setter for each of the two properties content and ensure.

# modules/demo/lib/puppet/provider/file_llmethod/ruby.rb

Puppet::Type.type(:file_llmethod).provide(:ruby) do
  def ensure
    File.exist?(@resource[:path]) ? :present : :absent
  end

  def ensure=(value)
    if value == :present
      # reuse setter
      self.content=(@resource[:content])
    else
      File.unlink(@resource[:path])
    end
  end

  def content
    File.read(@resource[:path])
  end

  def content=(value)
    File.write(@resource[:path], value)
  end
end

This gives us the following:

demo@85c63b50bfa3:~$ cat > file_llmethod_create.pp <<EOF
$file = '/home/demo/puppet-file-lowlevel-method-create'
file {$file: ensure => absent} ->
file_llmethod {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.01 seconds
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-create]/ensure: defined 'ensure' as 'present'

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.01 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-create]/ensure: removed
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-create]/ensure: defined 'ensure' as 'present'

demo@85c63b50bfa3:~$ cat > file_llmethod_change.pp <<EOF
$file = '/home/demo/puppet-file-lowlevel-method-change'
file {$file: content => 'This is madness'} ->
file_llmethod {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-change]/ensure: defined content as '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-change]/content: content changed 'This is madness' to 'This is Sparta!'

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-change]/content: content changed '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2' to '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-change]/content: content changed 'This is madness' to 'This is Sparta!'

Our custom type already kind of works, even though we have not implemented any explicit comparison of is- and should-state. Puppet does this for us based on the Puppet catalog and the property getter return values. Our defined setters are also invoked by Puppet on demand, only.

We can also see that the ensure state change notice is defined 'ensure' as 'present' and does not incorporate the desired content in any way, while the content state change notice shows plain text. Both tell us that the SHA256 checksum from the file_builtin.pp example is already something non-trivial.

Validating input

As a next step we add validation for path and content.

# modules/demo/lib/puppet/type/file_llmethod.rb

Puppet::Type.newtype(:file_llmethod) do
  newparam(:path, namevar: true) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
      fail "#{value} is not an absolute path" unless File.absolute_path?(value)
    end
  end

  newproperty(:content) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
    end
  end

  newproperty(:ensure) do
    newvalues(:present, :absent)
    defaultto(:present)
  end
end

Failed validations will look like these:

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules --exec 'file_llmethod {"./relative/path": }'
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Error: Parameter path failed on File_llmethod[./relative/path]: ./relative/path is not an absolute path (line: 1)

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules --exec 'file_llmethod {"/absolute/path": content => 42}'
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Error: Parameter content failed on File_llmethod[/absolute/path]: 42 is not a String (line: 1)

Content Checksums

We override change_to_s so that state changes include content checksums:

# modules/demo/lib/puppet/type/file_llmethod.rb

require 'digest'

Puppet::Type.newtype(:file_llmethod) do
  newparam(:path, namevar: true) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
      fail "#{value} is not an absolute path" unless File.absolute_path?(value)
    end
  end

  newproperty(:content) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
    end
    define_method(:change_to_s) do |currentvalue, newvalue|
      old = "{sha256}#{Digest::SHA256.hexdigest(currentvalue)}"
      new = "{sha256}#{Digest::SHA256.hexdigest(newvalue)}"
      "content changed '#{old}' to '#{new}'"
    end
  end

  newproperty(:ensure) do
    define_method(:change_to_s) do |currentvalue, newvalue| if currentvalue == :absent should = @resource.property(:content).should digest = "{sha256}#{Digest::SHA256.hexdigest(should)}" "defined content as '#{digest}'" else super(currentvalue, newvalue) end end newvalues(:present, :absent) defaultto(:present) end end

The above type definition yields:

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-create]/ensure: removed
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-create]/ensure: defined content as '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-change]/content: content changed '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2' to '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-change]/content: content changed '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e' to '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

Improving memory footprint

So far so good. While our current implementation apparently works, it has at least one major flaw. If the managed file already exists, the provider stores the file’s whole content in memory.

demo@85c63b50bfa3:~$ cat > file_llmethod_change_big.pp <<EOF
$file = '/home/demo/puppet-file-lowlevel-method-change_big'
file_llmethod {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ rm -f /home/demo/puppet-file-lowlevel-method-change_big
demo@85c63b50bfa3:~$ ulimit -Sv 200000

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change_big.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-change_big]/ensure: defined content as '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'
Notice: Applied catalog in 0.02 seconds

demo@85c63b50bfa3:~$ dd if=/dev/zero of=/home/demo/puppet-file-lowlevel-method-change_big seek=8G bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 8.3047e-05 s, 12.0 kB/s

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change_big.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Error: Could not run: failed to allocate memory

Instead, the implementation should only store the checksum so that Puppet can decide based on checksums if our content= setter needs to be invoked.

This also means that the Puppet catalog’s content needs to be checksummed by munge before it it processed by Puppet’s internal comparison routine. Luckily we also have access to the original value via shouldorig.

# modules/demo/lib/puppet/type/file_llmethod.rb

require 'digest'

Puppet::Type.newtype(:file_llmethod) do
  newparam(:path, namevar: true) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
      fail "#{value} is not an absolute path" unless File.absolute_path?(value)
    end
  end

  newproperty(:content) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
    end
    munge do |value|
      "{sha256}#{Digest::SHA256.hexdigest(value)}"
    end
    # No need to override change_to_s with munging
  end

  newproperty(:ensure) do
    define_method(:change_to_s) do |currentvalue, newvalue|
      if currentvalue == :absent
        should = @resource.property(:content).should
        "defined content as '#{should}'"
      else
        super(currentvalue, newvalue)
      end
    end
    newvalues(:present, :absent)
    defaultto(:present)
  end
end

# modules/demo/lib/puppet/provider/file_llmethod/ruby.rb

Puppet::Type.type(:file_llmethod).provide(:ruby) do
  ...

  def content
    File.open(@resource[:path], 'r') do |file|
      sha = Digest::SHA256.new
      while chunk = file.read(2**16)
        sha << chunk
      end
      "{sha256}#{sha.hexdigest}"
    end
  end

  def content=(value)
    # value is munged, but we need to write the original
    File.write(@resource[:path], @resource.parameter(:content).shouldorig[0])
  end
end

Now we can manage big files:

demo@85c63b50bfa3:~$ ulimit -Sv 200000
demo@85c63b50bfa3:~$ dd if=/dev/zero of=/home/demo/puppet-file-lowlevel-method-change_big seek=8G bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 9.596e-05 s, 10.4 kB/s
demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_change_big.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-change_big]/content: content changed '{sha256}ef17a425c57a0e21d14bec2001d8fa762767b97145b9fe47c5d4f2fda323697b' to '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

Ensure it the Puppet way

There is still something not right. Maybe you have noticed that our provider’s content getter attempts to open a file unconditionally, and yet the file_llmethod_create.pp run has not produced an error. It seems that an ensure transition from absent to present short-circuits the content getter, even though we have not expressed a wish to do so.

It turns out that an ensure property gets special treatment by Puppet. If we had attempted to use a makeitso property instead of ensure, there would be no short-circuiting and the content getter would raise an exception.

We will not fix the content getter though. If Puppet has special treatment for ensure, we should use Puppet’s intended mechanism for it, and declare the type ensurable:

# modules/demo/lib/puppet/type/file_llmethod.rb

require 'digest'

Puppet::Type.newtype(:file_llmethod) do
  ensurable

  newparam(:path, namevar: true) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
      fail "#{value} is not an absolute path" unless File.absolute_path?(value)
    end
  end

  newproperty(:content) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
    end
    munge do |value|
      "{sha256}#{Digest::SHA256.hexdigest(value)}"
    end
  end
end

With ensurable the provider needs to implement three new methods, but we can drop the ensure accessors:

# modules/demo/lib/puppet/provider/file_llmethod/ruby.rb

Puppet::Type.type(:file_llmethod).provide(:ruby) do
  def exists?
    File.exist?(@resource[:name])
  end

  def create
    self.content=(:dummy)
  end

  def destroy
    File.unlink(@resource[:name])
  end

  def content
    File.open(@resource[:path], 'r') do |file|
      sha = Digest::SHA256.new
      while chunk = file.read(2**16)
        sha << chunk
      end
      "{sha256}#{sha.hexdigest}"
    end
  end

  def content=(value)
    # value is munged, but we need to write the original
    File.write(@resource[:path], @resource.parameter(:content).shouldorig[0])
  end
end

However, now we have lost the SHA256 checksum on file creation:

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_create.pp Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-create]/ensure: removed Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-create]/ensure: created

To get it back, we replace ensurable by an adapted implementation of it, which includes our previous change_to_s override:

newproperty(:ensure, :parent => Puppet::Property::Ensure) do
  defaultvalues
  define_method(:change_to_s) do |currentvalue, newvalue|
    if currentvalue == :absent
      should = @resource.property(:content).should
      "defined content as '#{should}'"
    else
      super(currentvalue, newvalue)
    end
  end
end

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_llmethod_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.02 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-lowlevel-method-create]/ensure: removed
Notice: /Stage[main]/Main/File_llmethod[/home/demo/puppet-file-lowlevel-method-create]/ensure: defined content as '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

Our final low-level prototype is thus as follows.

Final low-level prototype

# modules/demo/lib/puppet/type/file_llmethod.rb

# frozen_string_literal: true

require 'digest'

Puppet::Type.newtype(:file_llmethod) do
  newparam(:path, namevar: true) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
      fail "#{value} is not an absolute path" unless File.absolute_path?(value)
    end
  end

  newproperty(:content) do
    validate do |value|
      fail "#{value} is not a String" unless value.is_a?(String)
    end
    munge do |value|
      "{sha256}#{Digest::SHA256.hexdigest(value)}"
    end
  end

  newproperty(:ensure, :parent => Puppet::Property::Ensure) do
    defaultvalues
    define_method(:change_to_s) do |currentvalue, newvalue|
      if currentvalue == :absent
        should = @resource.property(:content).should
        "defined content as '#{should}'"
      else
        super(currentvalue, newvalue)
      end
    end
  end
end

# modules/demo/lib/puppet/provider/file_llmethod/ruby.rb

# frozen_string_literal: true

Puppet::Type.type(:file_llmethod).provide(:ruby) do
  def exists?
    File.exist?(@resource[:name])
  end

  def create
    self.content=(:dummy)
  end

  def destroy
    File.unlink(@resource[:name])
  end

  def content
    File.open(@resource[:path], 'r') do |file|
      sha = Digest::SHA256.new
      while (chunk = file.read(2**16))
        sha << chunk
      end
      "{sha256}#{sha.hexdigest}"
    end
  end

  def content=(_value)
    # value is munged, but we need to write the original
    File.write(@resource[:path], @resource.parameter(:content).shouldorig[0])
  end
end

Resource-API prototype

According to the Resource-API documentation we need to define our new file_rsapi type by calling Puppet::ResourceApi.register_type with several parameters, amongst which are the desired attributes, even ensure.

# modules/demo/lib/puppet/type/file_rsapi.rb

require 'puppet/resource_api'

Puppet::ResourceApi.register_type(
  name: 'file_rsapi', 
  attributes: {
    content: {
      desc: 'description of content parameter',
      type: 'String' 
    },
    ensure: {
      default: 'present',
      desc: 'description of ensure parameter',
      type: 'Enum[present, absent]'
    },
    path: {
      behaviour: :namevar,
      desc: 'description of path parameter',
      type: 'Pattern[/\A\/([^\n\/\0]+\/*)*\z/]'
    },
  },
  desc: 'description of file_rsapi'
)

The path type uses a built-in Puppet data type. Stdlib::Absolutepath would have been more convenient but external data types are not possible with the Resource-API yet.

In comparison with our low-level prototype, the above type definition has no SHA256-munging and SHA256-output counterparts. The canonicalize provider feature looks similar to munging, but we skip it for now.

The Resource-API documentation tells us to implement a get and a set method in our provider, stating

The get method reports the current state of the managed resources. It returns an enumerable of all existing resources. Each resource is a hash with attribute names as keys, and their respective values as values.

This demand is the first bummer, as we definitely do not want to read all files with their content and store it in memory. We can ignore this demand – how would the Resource-API know anyway.

However, the documented signature is def get(context) {...} where context has no information about the resource we want to manage.

This would have been a show-stopper, if the simple_get_filter provider feature didn’t exist, which changes the signature to def get(context, names = nil) {...}.

Our first version of file_rsapi is thus the following.

Basic functionality

# modules/demo/lib/puppet/type/file_rsapi.rb

require 'puppet/resource_api'

Puppet::ResourceApi.register_type(
  name: 'file_rsapi',
  features: %w[simple_get_filter],
  attributes: {
    content: {
      desc: 'description of content parameter',
      type: 'String' 
    },
    ensure: {
      default: 'present',
      desc: 'description of ensure parameter',
      type: 'Enum[present, absent]'
    },
    path: {
      behaviour: :namevar,
      desc: 'description of path parameter',
      type: 'Pattern[/\A\/([^\n\/\0]+\/*)*\z/]'
    },
  },
  desc: 'description of file_rsapi'
)

# modules/demo/lib/puppet/provider/file_rsapi/file_rsapi.rb

require 'digest'

class Puppet::Provider::FileRsapi::FileRsapi
  def get(context, names)
    (names or []).map do |name|
      File.exist?(name) ? {
        path: name,
        ensure: 'present',
        content: filedigest(name),
      } : nil
    end.compact # remove non-existing resources
  end

  def set(context, changes)
    changes.each do |path, change|
      if change[:should][:ensure] == 'present'
        File.write(path, change[:should][:content])
      elsif File.exist?(path)
        File.delete(path)
      end
    end
  end

  def filedigest(path)
    File.open(path, 'r') do |file|
      sha = Digest::SHA256.new
      while chunk = file.read(2**16)
         sha << chunk
      end
      "{sha256}#{sha.hexdigest}"
    end
  end
end

The desired content is written correctly into the file, but we have again no SHA256 checksum on creation as well as unnecessary writes, because the checksum from get does not match the cleartext from the catalog:

demo@85c63b50bfa3:~$ cat > file_rsapi_create.pp <<EOF
$file = '/home/demo/puppet-file-rsapi-create'
file       {$file: ensure  => absent} ->
file_rsapi {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_rsapi_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.03 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-rsapi-create]/ensure: removed
Notice: /Stage[main]/Main/File_rsapi[/home/demo/puppet-file-rsapi-create]/ensure: defined 'ensure' as 'present'
Notice: Applied catalog in 0.02 seconds

demo@85c63b50bfa3:~$ cat > file_rsapi_change.pp <<EOF
$file = '/home/demo/puppet-file-rsapi-change'
file       {$file: content => 'This is madness'} ->
file_rsapi {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_rsapi_change.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.03 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-rsapi-change]/content: content changed '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2' to '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: /Stage[main]/Main/File_rsapi[/home/demo/puppet-file-rsapi-change]/content: content changed '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e' to 'This is Sparta!'
Notice: Applied catalog in 0.03 seconds

Hence we enable and implement the canonicalize provider feature.

Canonicalize

According to the documentation, canonicalize is applied to the results of get as well as to the catalog properties. On the one hand, we do not want to read the file’s content into memory, on the other hand cannot checksum the file’s content twice.

An easy way would be to check whether the canonicalized call happens after the get call:

def canonicalize(context, resources)
  if @stage == :get
    # do nothing, is-state canonicalization already performed by get()
  else
    # catalog canonicalization
    ...
  end
end

def get(context, paths)
  @stage = :get

  ...
end

While this works for the current implementation of the Resource-API, there is no guarantee about the order of canonicalize calls. Instead we subclass from String and handle checksumming internally. We also add some state change calls to the appropriate context methods.

Our final Resource-API-based prototype implementation is:

# modules/demo/lib/puppet/type/file_rsapi.rb

# frozen_string_literal: true

require 'puppet/resource_api'

Puppet::ResourceApi.register_type(
  name: 'file_rsapi',
  features: %w[simple_get_filter canonicalize],
  attributes: {
    content: {
      desc: 'description of content parameter',
      type: 'String'
    },
    ensure: {
      default: 'present',
      desc: 'description of ensure parameter',
      type: 'Enum[present, absent]'
    },
    path: {
      behaviour: :namevar,
      desc: 'description of path parameter',
      type: 'Pattern[/\A\/([^\n\/\0]+\/*)*\z/]'
    }
  },
  desc: 'description of file_rsapi'
)

# modules/demo/lib/puppet/provider/file_rsapi/file_rsapi.rb

# frozen_string_literal: true

require 'digest'
require 'pathname'

class Puppet::Provider::FileRsapi::FileRsapi
  class CanonicalString < String
    attr_reader :original

    def class
      # Mask as String for YAML.dump to mitigate
      #   Error: Transaction store file /var/cache/puppet/state/transactionstore.yaml
      #   is corrupt ((/var/cache/puppet/state/transactionstore.yaml): Tried to
      #   load unspecified class: Puppet::Provider::FileRsapi::FileRsapi::CanonicalString)
      String
    end

    def self.from(obj)
      return obj if obj.is_a?(self)
      return new(filedigest(obj)) if obj.is_a?(Pathname)

      new("{sha256}#{Digest::SHA256.hexdigest(obj)}", obj)
    end

    def self.filedigest(path)
      File.open(path, 'r') do |file|
        sha = Digest::SHA256.new
        while (chunk = file.read(2**16))
          sha << chunk
        end
        "{sha256}#{sha.hexdigest}"
      end
    end

    def initialize(canonical, original = nil)
      @original = original
      super(canonical)
    end
  end

  def canonicalize(_context, resources)
    resources.each do |resource|
      next if resource[:ensure] == 'absent'

      resource[:content] = CanonicalString.from(resource[:content])
    end
    resources
  end

  def get(_context, names)
    (names or []).map do |name|
      next unless File.exist?(name)

      {
        content: CanonicalString.from(Pathname.new(name)),
        ensure: 'present',
        path: name
      }
    end.compact # remove non-existing resources
  end

  def set(context, changes)
    changes.each do |path, change|
      if change[:should][:ensure] == 'present'
        File.write(path, change[:should][:content].original)
        if change[:is][:ensure] == 'present'
          # The only other possible change is due to content,
          # but content change transition info is covered implicitly
        else
          context.created("#{path} with content '#{change[:should][:content]}'")
        end
      elsif File.exist?(path)
        File.delete(path)
        context.deleted(path)
      end
    end
  end
end

It gives us:

demo@85c63b50bfa3:~$ cat > file_rsapi_create.pp <<EOF
$file = '/home/demo/puppet-file-rsapi-create'
file       {$file: ensure  => absent} ->
file_rsapi {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_rsapi_create.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.03 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-rsapi-create]/ensure: removed
Notice: /Stage[main]/Main/File_rsapi[/home/demo/puppet-file-rsapi-create]/ensure: defined 'ensure' as 'present'
Notice: file_rsapi: Created: /home/demo/puppet-file-rsapi-create with content '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

demo@85c63b50bfa3:~$ cat > file_rsapi_change.pp <<EOF
$file = '/home/demo/puppet-file-rsapi-change'
file       {$file: content => 'This is madness'} ->
file_rsapi {$file: content => 'This is Sparta!'}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_rsapi_change.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.03 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-rsapi-change]/content: content changed '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2' to '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e'
Notice: /Stage[main]/Main/File_rsapi[/home/demo/puppet-file-rsapi-change]/content: content changed '{sha256}0549defd0a7d6d840e3a69b82566505924cacbe2a79392970ec28cddc763949e' to '{sha256}823cbb079548be98b892725b133df610d0bff46b33e38b72d269306d32b73df2'

demo@85c63b50bfa3:~$ cat > file_rsapi_remove.pp <<EOF
$file = '/home/demo/puppet-file-rsapi-remove'
file       {$file: ensure => present} ->
file_rsapi {$file: ensure => absent}
EOF

demo@85c63b50bfa3:~$ bin/puppet apply --modulepath modules file_rsapi_remove.pp
Notice: Compiled catalog for 85c63b50bfa3 in environment production in 0.03 seconds
Notice: /Stage[main]/Main/File[/home/demo/puppet-file-rsapi-remove]/ensure: created
Notice: /Stage[main]/Main/File_rsapi[/home/demo/puppet-file-rsapi-remove]/ensure: undefined 'ensure' from 'present'
Notice: file_rsapi: Deleted: /home/demo/puppet-file-rsapi-remove

Thoughts

The apparent need for a CanonicalString masking as String makes it look like we are missing something. If the Resource-API only checked data types after canonicalization, we could make get return something simpler than CanonicalString to signal an already canonicalized value.

The Resource-API’s default demand to return all existing resources simplifies development when one had this plan anyway. The low-level version of doing so is often a combination of prefetch and instances.

Introduction

Redis is a widely popular in-memory key-value high-performance database, which can also be used as a cache and message broker. It has been a go-to choice for many due to its performance and versatility. Many cloud providers offer Redis-based solutions:

Amazon Web Services (AWS) – Amazon ElastiCache for Redis
Microsoft Azure – Azure Cache for Redis
Google Cloud Platform (GCP) – Google Cloud Memorystore for Redis

However, due to recent changes in the licensing model of Redis, its prominence and usage are changing. Redis was initially developed under the open-source BSD license, allowing developers to freely use, modify and distribute the source code for both commercial and non-commercial purposes. As a result, Redis quickly gained popularity in the developer community.

But, Redis has recently changed to a dual source-available license. To be precise, in the future, it will be available under RSALv2 (Redis Source Available License Version 2) or SSPLv1 (Server Side Public License Version 1), commercial use requires individual agreements, potentially increasing costs for cloud service providers. For a detailed overview of these changes, refer to the Redis licensing page. Based on the Redis Community Edition, the source code will remain freely available for developers, customers and partners of the company. However, cloud service providers and others who want to use Redis as part of commercial offerings will have to make individual agreements with the provider.

Due to these recent changes in Redis’s licensing model, many developers and organizations are re-evaluating their in-memory key-value database choices. Valkey, an open-source fork of Redis, maintains the high performance and versatility while ensuring unrestricted use for both developers and commercial entities. The Linux Foundation forked the project and contributors are now supporting the Valkey project. More information can be found here and here. Its commitment to open-source principles has gained support from major cloud providers, including AWS. Amazon Web Services (AWS) announced “AWS is committed to supporting open source Valkey for the long term“, more information can be found here. So it may be the right time to switch the infrastructure from Redis to Valkey.

In this article, we will set up a Valkey instance with TLS and outline the steps to migrate your data from Redis seamlessly.

Overview of possible migration approaches

In general, there are several approaches to migrate:

Reuse the database file
In this approach, Redis is shutted down to update the rdb file on disk and Valkey will be started using this file in its data directory.
Use REPLICAOF to connect Valkey to the Redis instance
Register the new Valkey instance as a replica of a Redis master to stream the data. The Valkey instance and its network must be able to reach the Redis service.
Automated data migration to Valkey
Scripting the migration can be used on a machine that can reach both the Redis and Valkey databases.

In this bog article, we encounter that direct access to the file system of the Redis server in the cloud is not feasible to reuse the database file and that the Valkey service and Redis service are in different networks and cannot reach themselves for setting up a replica. As a result, we choose the third option and run an automated data migration script on a different machine, which can connect to both servers and transfer the data.

Setup of Valkey

In case you are using a cloud service, please consult their instructions how to setup a Valkey instance. Since it is a new project there are only a few distributions, which provides ready-to-use packages like Red Hat Enterprise Linux 8 and 9 via Extra Packages for Enterprise Linux (EPEL). In this blog post, we use an on-premises Debian 12 server to host the Valkey server in version 7.2.6 with TLS . Please consult your distribution guides to install Valkey or use the manual provided on GitHub. The migration itself with be done by a Python 3 script using TLS.

Start the server and establish a client connection:

In this bog article, we will use a server with the listed TLS parameters. We specify all used TLS parameters including port 0 to disable the non-TLS port completely:

$ valkey-server --tls-port 6379 --port 0 --tls-cert-file ./tls/redis.crt --tls-key-file ./tls/redis.key --tls-ca-cert-file ./tls/ca.crt

                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 7.2.6 (579cca5f/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 6379
 |###|   .+###############+.   |###|    PID: 436767                     
 |###|   |#####*'' ''*#####|   |###|                                 
 |###|   |####'  .-.  '####|   |###|                                 
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io      
 |###|   |####.  '-'  .####|   |###|                                 
 |###|   |#####*.   .*#####|   |###|                                 
 |###|   '+#####|   |#####+'   |###|                                 
 |####+.     +##|   |#+'     .+####|                                 
 '#######+   |##|        .+########'                                 
    '+###|   |##|    .+########+'                                    
        '|   |####+########+'                                        
             +#########+'                                            
                '+v+'                                                

436767:M 27 Aug 2024 16:08:56.058 * Server initialized
436767:M 27 Aug 2024 16:08:56.058 * Loading RDB produced by valkey version 7.2.6
[...]
436767:M 27 Aug 2024 16:08:56.058 * Ready to accept connections tls

Now it is time to test the connection with a client using TLS:

$ valkey-cli --tls --cert ./tls/redis.crt --key ./tls/redis.key --cacert ./tls/ca.crt -p 6379

127.0.0.1:6379> INFO SERVER
# Server
server_name:valkey
valkey_version:7.2.6
[...]

Automated data migration to Valkey

Finally, we migrate the data in this example using a Python 3 script. This Python script establishes connections to both the Redis source and Valkey target databases, fetches all the keys from the Redis database and creates or updates each key-value pair in the Valkey database. This approach is not off the shelf and uses the redis-py library, which provides a list of examples. By using Python 3 the process could even be extended to filter unwanted data, alter values to be suitable for the new environment or by adding sanity checks. The script, which is used here, provides progress updates during the migration process:

#!/usr/bin/env python3

import redis

# Connect to the Redis source database, which is password protected, via IP and port
redis_client = redis.StrictRedis(host='172.17.0.3', port=6379, password='secret', db=0)

# Connect to the Valkey target database, which is using TLS
ssl_certfile="./tls/client.crt"
ssl_keyfile="./tls/client.key"
ssl_ca_certs="./tls/ca.crt"
valkey_client = redis.Redis(
    host="192.168.0.3",
    port=6379,
    ssl=True,
    ssl_certfile=ssl_certfile,
    ssl_keyfile=ssl_keyfile,
    ssl_cert_reqs="required",
    ssl_ca_certs=ssl_ca_certs,
)

# Fetch all keys from the Redis database
keys = redis_client.keys('*')
print("Found", len(keys), "Keys in Source!")

# Migrate each key-value pair to the Valkey database
for counter, key in enumerate(keys):
    value = redis_client.get(key)
    valkey_client.set(key, value)
    print("Status: ", round((counter+1) / len(keys) * 100, 1), "%", end='\r')
print()

To start the process execute the script:

$ python3 redis_to_tls_valkey.py
Found 569383 Keys in Source!
Status: 100.0 %

As a last step, configure your application to connect to the new Valkey server.

Conclusion

Since the change of Redis’ license, the new project Valkey is gaining more and more attraction. Migrating to Valkey ensures continued access to a robust, open-source in-memory database without the licensing restrictions of Redis. Whether you’re running your infrastructure on-premises or in the cloud, this guide provides the steps needed for a successful migration. Migrating from a cloud instance to a new environment can be cumbersome, because of no direct file access or isolated networks. Depending on these circumstances, we used a Python script, which is a flexible way to implement various steps to master the task.

If you find this guide helpful and in case you need support to migrate your databases, feel free to contact us. We like to support you on-premises or in cloud environments.

Mastering Cloud Infrastructure with Pulumi: Introduction

In today’s rapidly changing landscape of cloud computing, managing infrastructure as code (IaC) has become essential for developers and IT professionals. Pulumi, an open-source IaC tool, brings a fresh perspective to the table by enabling infrastructure management using popular programming languages like JavaScript, TypeScript, Python, Go, and C#. This approach offers a unique blend of flexibility and power, allowing developers to leverage their existing coding skills to build, deploy, and manage cloud infrastructure. In this post, we’ll explore the world of Pulumi and see how it pairs with Amazon FSx for NetApp ONTAP—a robust solution for scalable and efficient cloud storage.

Pulumi – The Theory

Why Pulumi?

Pulumi distinguishes itself among IaC tools for several compelling reasons:

Use Familiar Programming Languages: Unlike traditional IaC tools that rely on domain-specific languages (DSLs), Pulumi allows you to use familiar programming languages. This means no need to learn new syntax, and you can incorporate sophisticated logic, conditionals, and loops directly in your infrastructure code.
Seamless Integration with Development Workflows: Pulumi integrates effortlessly with existing development workflows and tools, making it a natural fit for modern software projects. Whether you’re managing a simple web app or a complex, multi-cloud architecture, Pulumi provides the flexibility to scale without sacrificing ease of use.

Challenges with Pulumi

Like any tool, Pulumi comes with its own set of challenges:

Learning Curve: While Pulumi leverages general-purpose languages, developers need to be proficient in the language they choose, such as Python or TypeScript. This can be a hurdle for those unfamiliar with these languages.
Growing Ecosystem: As a relatively new tool, Pulumi’s ecosystem is still expanding. It might not yet match the extensive plugin libraries of older IaC tools, but its vibrant and rapidly growing community is a promising sign of things to come.

State Management in Pulumi: Ensuring Consistency Across Deployments

Effective infrastructure management hinges on proper state handling. Pulumi excels in this area by tracking the state of your infrastructure, enabling it to manage resources efficiently. This capability ensures that Pulumi knows exactly what needs to be created, updated, or deleted during deployments. Pulumi offers several options for state storage:

Local State: Stored directly on your local file system. This option is ideal for individual projects or simple setups.
Remote State: By default, Pulumi stores state remotely on the Pulumi Service (a cloud-hosted platform provided by Pulumi), but it also allows you to configure storage on AWS S3, Azure Blob Storage, or Google Cloud Storage. This is particularly useful in team environments where collaboration is essential.

Managing state effectively is crucial for maintaining consistency across deployments, especially in scenarios where multiple team members are working on the same infrastructure.

Other IaC Tools: Comparing Pulumi to Traditional IaC Tools

When comparing Pulumi to other Infrastructure as Code (IaC) tools, several drawbacks of traditional approaches become evident:

Domain-Specific Language (DSL) Limitations: Many IaC tools depend on DSLs, such as Terraform’s HCL, requiring users to learn a specialized language specific to the tool.
YAML/JSON Constraints: Tools that rely on YAML or JSON can be both restrictive and verbose, complicating the management of more complex configurations.
Steep Learning Curve: The necessity to master DSLs or particular configuration formats adds to the learning curve, especially for newcomers to IaC.
Limited Logical Capabilities: DSLs often lack support for advanced logic constructs such as loops, conditionals, and reusability. This limitation can lead to repetitive code that is challenging to maintain.
Narrow Ecosystem: Some IaC tools have a smaller ecosystem, offering fewer plugins, modules, and community-driven resources.
Challenges with Code Reusability: The inability to reuse code across different projects or components can hinder efficiency and scalability in infrastructure management.
Testing Complexity: Testing infrastructure configurations written in DSLs can be challenging, making it difficult to ensure the reliability and robustness of the infrastructure code.

Pulumi – In Practice

Introduction

In the this section, we’ll dive into a practical example to better understand Pulumi’s capabilities. We’ll also explore how to set up a project using Pulumi with AWS and automate it using GitHub Actions for CI/CD.

Prerequisites

Before diving into using Pulumi with AWS and automating your infrastructure management through GitHub Actions, ensure you have the following prerequisites in place:

Pulumi CLI: Begin by installing the Pulumi CLI by following the official installation instructions. After installation, verify that Pulumi is correctly set up and accessible in your system’s PATH by running a quick version check.
AWS CLI: Install the AWS CLI, which is essential for interacting with AWS services. Configure the AWS CLI with your AWS credentials to ensure you have access to the necessary AWS resources. Ensure your AWS account is equipped with the required permissions, especially for IAM, EC2, S3, and any other AWS services you plan to manage with Pulumi.
AWS IAM User/Role for GitHub Actions: Create a dedicated IAM user or role in AWS specifically for use in your GitHub Actions workflows. This user or role should have permissions necessary to manage the resources in your Pulumi stack. Store the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY securely as secrets in your GitHub repository.
Pulumi Account: Set up a Pulumi account if you haven’t already. Generate a Pulumi access token and store it as a secret in your GitHub repository to facilitate secure automation.
Python and Pip: Install Python (version 3.7 or higher is recommended) along with Pip, which are necessary for Pulumi’s Python SDK. Once Python is installed, proceed to install Pulumi’s Python SDK along with any required AWS packages to enable infrastructure management through Python.
GitHub Account: Ensure you have an active GitHub account to host your code and manage your repository. Create a GitHub repository where you’ll store your Pulumi project and related automation workflows. Store critical secrets like AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and your Pulumi access token securely in the GitHub repository’s secrets section.
GitHub Runners: Utilize GitHub-hosted runners to execute your GitHub Actions workflows, or set up self-hosted runners if your project requires them. Confirm that the runners have all necessary tools installed, including Pulumi, AWS CLI, Python, and any other dependencies your Pulumi project might need.

Project Structure

When working with Infrastructure as Code (IaC) using Pulumi, maintaining an organized project structure is essential. A clear and well-defined directory structure not only streamlines the development process but also improves collaboration and deployment efficiency. In this post, we’ll explore a typical directory structure for a Pulumi project and explain the significance of each component.

Overview of a Typical Pulumi Project Directory

A standard Pulumi project might be organized as follows:


/project-root
├── .github
│ └── workflows
│ └── workflow.yml # GitHub Actions workflow for CI/CD
├── __main__.py # Entry point for the Pulumi program
├── infra.py # Infrastructure code
├── pulumi.dev.yml # Pulumi configuration for the development environment
├── pulumi.prod.yml # Pulumi configuration for the production environment
├── pulumi.yml # Pulumi configuration (common or default settings)
├── requirements.txt # Python dependencies
└── test_infra.py # Tests for infrastructure code

NetApp FSx on AWS

Introduction

Amazon FSx for NetApp ONTAP offers a fully managed, scalable storage solution built on the NetApp ONTAP file system. It provides high-performance, highly available shared storage that seamlessly integrates with your AWS environment. Leveraging the advanced data management capabilities of ONTAP, FSx for NetApp ONTAP is ideal for applications needing robust storage features and compatibility with existing NetApp systems.

Key Features

High Performance: FSx for ONTAP delivers low-latency storage designed to handle demanding, high-throughput workloads.
Scalability: Capable of scaling to support petabytes of storage, making it suitable for both small and large-scale applications.
Advanced Data Management: Leverages ONTAP’s comprehensive data management features, including snapshots, cloning, and disaster recovery.
Multi-Protocol Access: Supports NFS and SMB protocols, providing flexible access options for a variety of clients.
Cost-Effectiveness: Implements tiering policies to automatically move less frequently accessed data to lower-cost storage, helping optimize storage expenses.

What It’s About

Setting up Pulumi for managing your cloud infrastructure can revolutionize the way you deploy and maintain resources. By leveraging familiar programming languages, Pulumi brings Infrastructure as Code (IaC) to life, making the process more intuitive and efficient. When paired with Amazon FSx for NetApp ONTAP, it unlocks advanced storage solutions within the AWS ecosystem.

Putting It All Together

Using Pulumi, you can define and deploy a comprehensive AWS infrastructure setup, seamlessly integrating the powerful FSx for NetApp ONTAP file system. This combination simplifies cloud resource management and ensures you harness the full potential of NetApp’s advanced storage capabilities, making your cloud operations more efficient and robust.

In the next sections, we’ll walk through the specifics of setting up each component using Pulumi code, illustrating how to create a VPC, configure subnets, set up a security group, and deploy an FSx for NetApp ONTAP file system, all while leveraging the robust features provided by both Pulumi and AWS.

Architecture Overview

A visual representation of the architecture we’ll deploy using Pulumi: Single AZ Deployment with FSx and EC2

The diagram above illustrates the architecture for deploying an FSx for NetApp ONTAP file system within a single Availability Zone. The setup includes a VPC with public and private subnets, an Internet Gateway for outbound traffic, and a Security Group controlling access to the FSx file system and the EC2 instance. The EC2 instance is configured to mount the FSx volume using NFS, enabling seamless access to storage.

Setting up Pulumi

Follow these steps to set up Pulumi and integrate it with AWS:

Install Pulumi: Begin by installing Pulumi using the following command:

curl -fsSL https://get.pulumi.com | sh

Install AWS CLI: If you haven’t installed it yet, install the AWS CLI to manage AWS services:

pip install awscli

Configure AWS CLI: Configure the AWS CLI with your credentials:

aws configure

Create a New Pulumi Project: Initialize a new Pulumi project with AWS and Python:

pulumi new aws-python

Configure Your Pulumi Stack: Set the AWS region for your Pulumi stack:

pulumi config set aws:region eu-central-1

Deploy Your Stack: Deploy your infrastructure using Pulumi:

pulumi preview ; pulumi up

Example: VPC, Subnets, and FSx for NetApp ONTAP

Let’s dive into an example Pulumi project that sets up a Virtual Private Cloud (VPC), subnets, a security group, an Amazon FSx for NetApp ONTAP file system, and an EC2 instance.

Pulumi Code Example: VPC, Subnets, and FSx for NetApp ONTAP

The first step is to define all the parameters required to set up the infrastructure. You can use the following example to configure these parameters as specified in the pulumi.dev.yaml file.

This pulumi.dev.yaml file contains configuration settings for a Pulumi project. It specifies various parameters for the deployment environment, including the AWS region, availability zones, and key name. It also defines CIDR blocks for subnets. These settings are used to configure and deploy cloud infrastructure resources in the specified AWS region.


config:
  aws:region: eu-central-1
  demo:availabilityZone: eu-central-1a
  demo:keyName: XYZ
  demo:subnet1CIDER: 10.0.3.0/24
  demo:subnet2CIDER: 10.0.4.0/24

The following code snippet should be placed in the infra.py file. It details the setup of the VPC, subnets, security group, and FSx for NetApp ONTAP file system. Each step in the code is explained through inline comments.


import pulumi
import pulumi_aws as aws
import pulumi_command as command
import os

# Retrieve configuration values from Pulumi configuration files
aws_config = pulumi.Config("aws")
region = aws_config.require("region")  # The AWS region where resources will be deployed

demo_config = pulumi.Config("demo")
availability_zone = demo_config.require("availabilityZone")  # Availability Zone for the deployment
subnet1_cidr = demo_config.require("subnet1CIDER")  # CIDR block for the public subnet
subnet2_cidr = demo_config.require("subnet2CIDER")  # CIDR block for the private subnet
key_name = demo_config.require("keyName")  # Name of the SSH key pair for EC2 instance access# Create a new VPC with DNS support enabled
vpc = aws.ec2.Vpc(
    "fsxVpc",
    cidr_block="10.0.0.0/16",  # VPC CIDR block
    enable_dns_support=True,    # Enable DNS support in the VPC
    enable_dns_hostnames=True   # Enable DNS hostnames in the VPC
)

# Create an Internet Gateway to allow internet access from the VPC
internet_gateway = aws.ec2.InternetGateway(
    "vpcInternetGateway",
    vpc_id=vpc.id  # Attach the Internet Gateway to the VPC
)

# Create a public route table for routing internet traffic via the Internet Gateway
public_route_table = aws.ec2.RouteTable(
    "publicRouteTable",
    vpc_id=vpc.id,
    routes=[aws.ec2.RouteTableRouteArgs(
        cidr_block="0.0.0.0/0",  # Route all traffic (0.0.0.0/0) to the Internet Gateway
        gateway_id=internet_gateway.id
    )]
)

# Create a single public subnet in the specified Availability Zone
public_subnet = aws.ec2.Subnet(
    "publicSubnet",
    vpc_id=vpc.id,
    cidr_block=subnet1_cidr,  # CIDR block for the public subnet
    availability_zone=availability_zone,  # The specified Availability Zone
    map_public_ip_on_launch=True  # Assign public IPs to instances launched in this subnet
)

# Create a single private subnet in the same Availability Zone
private_subnet = aws.ec2.Subnet(
    "privateSubnet",
    vpc_id=vpc.id,
    cidr_block=subnet2_cidr,  # CIDR block for the private subnet
    availability_zone=availability_zone  # The same Availability Zone
)

# Associate the public subnet with the public route table to enable internet access
public_route_table_association = aws.ec2.RouteTableAssociation(
    "publicRouteTableAssociation",
    subnet_id=public_subnet.id,
    route_table_id=public_route_table.id
)

# Create a security group to control inbound and outbound traffic for the FSx file system
security_group = aws.ec2.SecurityGroup(
    "fsxSecurityGroup",
    vpc_id=vpc.id,
    description="Allow NFS traffic",  # Description of the security group
    ingress=[
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=2049,  # NFS protocol port
            to_port=2049,
            cidr_blocks=["0.0.0.0/0"]  # Allow NFS traffic from anywhere
        ),
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=111,  # RPCBind port for NFS
            to_port=111,
            cidr_blocks=["0.0.0.0/0"]  # Allow RPCBind traffic from anywhere
        ),
        aws.ec2.SecurityGroupIngressArgs(
            protocol="udp",
            from_port=111,  # RPCBind port for NFS over UDP
            to_port=111,
            cidr_blocks=["0.0.0.0/0"]  # Allow RPCBind traffic over UDP from anywhere
        ),
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=22,  # SSH port for EC2 instance access
            to_port=22,
            cidr_blocks=["0.0.0.0/0"]  # Allow SSH traffic from anywhere
        )
    ],
    egress=[
        aws.ec2.SecurityGroupEgressArgs(
            protocol="-1",  # Allow all outbound traffic
            from_port=0,
            to_port=0,
            cidr_blocks=["0.0.0.0/0"]  # Allow all outbound traffic to anywhere
        )
    ]
)

# Create the FSx for NetApp ONTAP file system in the private subnet
file_system = aws.fsx.OntapFileSystem(
    "fsxFileSystem",
    subnet_ids=[private_subnet.id],  # Deploy the FSx file system in the private subnet
    preferred_subnet_id=private_subnet.id,  # Preferred subnet for the FSx file system
    security_group_ids=[security_group.id],  # Attach the security group to the FSx file system
    deployment_type="SINGLE_AZ_1",  # Single Availability Zone deployment
    throughput_capacity=128,  # Throughput capacity in MB/s
    storage_capacity=1024  # Storage capacity in GB
)

# Create a Storage Virtual Machine (SVM) within the FSx file system
storage_virtual_machine = aws.fsx.OntapStorageVirtualMachine(
    "storageVirtualMachine",
    file_system_id=file_system.id,  # Associate the SVM with the FSx file system
    name="svm1",  # Name of the SVM
    root_volume_security_style="UNIX"  # Security style for the root volume
)

# Create a volume within the Storage Virtual Machine (SVM)
volume = aws.fsx.OntapVolume(
    "fsxVolume",
    storage_virtual_machine_id=storage_virtual_machine.id,  # Associate the volume with the SVM
    name="vol1",  # Name of the volume
    junction_path="/vol1",  # Junction path for mounting
    size_in_megabytes=10240,  # Size of the volume in MB
    storage_efficiency_enabled=True,  # Enable storage efficiency features
    tiering_policy=aws.fsx.OntapVolumeTieringPolicyArgs(
        name="SNAPSHOT_ONLY"  # Tiering policy for the volume
    ),
    security_style="UNIX"  # Security style for the volume
)

# Extract the DNS name from the list of SVM endpoints
dns_name = storage_virtual_machine.endpoints.apply(lambda e: e[0]['nfs'][0]['dns_name'])

# Get the latest Amazon Linux 2 AMI for the EC2 instance
ami = aws.ec2.get_ami(
    most_recent=True,
    owners=["amazon"],
    filters=[{"name": "name", "values": ["amzn2-ami-hvm-*-x86_64-gp2"]}]  # Filter for Amazon Linux 2 AMI
)

# Create an EC2 instance in the public subnet
ec2_instance = aws.ec2.Instance(
    "fsxEc2Instance",
    instance_type="t3.micro",  # Instance type for the EC2 instance
    vpc_security_group_ids=[security_group.id],  # Attach the security group to the EC2 instance
    subnet_id=public_subnet.id,  # Deploy the EC2 instance in the public subnet
    ami=ami.id,  # Use the latest Amazon Linux 2 AMI
    key_name=key_name,  # SSH key pair for accessing the EC2 instance
    tags={"Name": "FSx EC2 Instance"}  # Tag for the EC2 instance
)

# User data script to install NFS client and mount the FSx volume on the EC2 instance
user_data_script = dns_name.apply(lambda dns: f"""#!/bin/bash
sudo yum update -y
sudo yum install -y nfs-utils
sudo mkdir -p /mnt/fsx
if ! mountpoint -q /mnt/fsx; then
sudo mount -t nfs {dns}:/vol1 /mnt/fsx
fi
""")

# Retrieve the private key for SSH access from environment variables while running with Github Actions
private_key_content = os.getenv("PRIVATE_KEY")
print(private_key_content)

# Ensure the FSx file system is available before executing the script on the EC2 instance
pulumi.Output.all(file_system.id, ec2_instance.public_ip).apply(lambda args: command.remote.Command(
    "mountFsxFileSystem",
    connection=command.remote.ConnectionArgs(
        host=args[1],
        user="ec2-user",
        private_key=private_key_content
    ),
    create=user_data_script,
    opts=pulumi.ResourceOptions(depends_on=[volume])
))

Pytest with Pulumi

Introduction

Pytest is a widely-used Python testing framework that allows developers to create simple and scalable test cases. When paired with Pulumi, an infrastructure as code (IaC) tool, Pytest enables thorough testing of cloud infrastructure code, akin to application code testing. This combination is crucial for ensuring that infrastructure configurations are accurate, secure, and meet the required state before deployment. By using Pytest with Pulumi, you can validate resource properties, mock cloud provider responses, and simulate various scenarios. This reduces the risk of deploying faulty infrastructure and enhances the reliability of your cloud environments. Although integrating Pytest into your CI/CD pipeline is not mandatory, it is highly beneficial as it leverages Python’s robust testing capabilities with Pulumi.

Testing Code

The following code snippet should be placed in the infra_test.py file. It is designed to test the infrastructure setup defined in infra.py, including the VPC, subnets, security group, and FSx for NetApp ONTAP file system. Each test case focuses on different aspects of the infrastructure to ensure correctness, security, and that the desired state is achieved. Inline comments are provided to explain each test case.


# Importing necessary libraries
import pulumi
import pulumi_aws as aws
from typing import Any, Dict, List

# Setting up configuration values for AWS region and various parameters
pulumi.runtime.set_config('aws:region', 'eu-central-1')
pulumi.runtime.set_config('demo:availabilityZone1', 'eu-central-1a')
pulumi.runtime.set_config('demo:availabilityZone2', 'eu-central-1b')
pulumi.runtime.set_config('demo:subnet1CIDER', '10.0.3.0/24')
pulumi.runtime.set_config('demo:subnet2CIDER', '10.0.4.0/24')
pulumi.runtime.set_config('demo:keyName', 'XYZ') - Change based on your own key

# Creating a class MyMocks to mock Pulumi's resources for testing
class MyMocks(pulumi.runtime.Mocks):
    def new_resource(self, args: pulumi.runtime.MockResourceArgs) -> List[Any]:
        # Initialize outputs with the resource's inputs
        outputs = args.inputs

        # Mocking specific resources based on their type
        if args.typ == "aws:ec2/instance:Instance":
            # Mocking an EC2 instance with some default values
            outputs = {
                **args.inputs,  # Start with the given inputs
                "ami": "ami-0eb1f3cdeeb8eed2a",  # Mock AMI ID
                "availability_zone": "eu-central-1a",  # Mock availability zone
                "publicIp": "203.0.113.12",  # Mock public IP
                "publicDns": "ec2-203-0-113-12.compute-1.amazonaws.com",  # Mock public DNS
                "user_data": "mock user data script",  # Mock user data
                "tags": {"Name": "test"}  # Mock tags
            }
        elif args.typ == "aws:ec2/securityGroup:SecurityGroup":
            # Mocking a Security Group with default ingress rules
            outputs = {
                **args.inputs,
                "ingress": [
                    {"from_port": 80, "cidr_blocks": ["0.0.0.0/0"]},  # Allow HTTP traffic from anywhere
                    {"from_port": 22, "cidr_blocks": ["192.168.0.0/16"]}  # Allow SSH traffic from a specific CIDR block
                ]
            }
        
        # Returning a mocked resource ID and the output values
        return [args.name + '_id', outputs]

    def call(self, args: pulumi.runtime.MockCallArgs) -> Dict[str, Any]:
        # Mocking a call to get an AMI
        if args.token == "aws:ec2/getAmi:getAmi":
            return {
                "architecture": "x86_64",  # Mock architecture
                "id": "ami-0eb1f3cdeeb8eed2a",  # Mock AMI ID
            }
        
        # Return an empty dictionary if no specific mock is needed
        return {}

# Setting the custom mocks for Pulumi
pulumi.runtime.set_mocks(MyMocks())

# Import the infrastructure to be tested
import infra

# Define a test function to validate the AMI ID of the EC2 instance
@pulumi.runtime.test
def test_instance_ami():
    def check_ami(ami_id: str) -> None:
        print(f"AMI ID received: {ami_id}")
        # Assertion to ensure the AMI ID is the expected one
        assert ami_id == "ami-0eb1f3cdeeb8eed2a", 'EC2 instance must have the correct AMI ID'

    # Running the test to check the AMI ID
    pulumi.runtime.run_in_stack(lambda: infra.ec2_instance.ami.apply(check_ami))

# Define a test function to validate the availability zone of the EC2 instance
@pulumi.runtime.test
def test_instance_az():
    def check_az(availability_zone: str) -> None:
        print(f"Availability Zone received: {availability_zone}")
        # Assertion to ensure the instance is in the correct availability zone
        assert availability_zone == "eu-central-1a", 'EC2 instance must be in the correct availability zone'
    
    # Running the test to check the availability zone
    pulumi.runtime.run_in_stack(lambda: infra.ec2_instance.availability_zone.apply(check_az))

# Define a test function to validate the tags of the EC2 instance
@pulumi.runtime.test
def test_instance_tags():
    def check_tags(tags: Dict[str, Any]) -> None:
        print(f"Tags received: {tags}")
        # Assertions to ensure the instance has tags and a 'Name' tag
        assert tags, 'EC2 instance must have tags'
        assert 'Name' in tags, 'EC2 instance must have a Name tag'
    
    # Running the test to check the tags
    pulumi.runtime.run_in_stack(lambda: infra.ec2_instance.tags.apply(check_tags))

# Define a test function to validate the user data script of the EC2 instance
@pulumi.runtime.test
def test_instance_userdata():
    def check_user_data(user_data_script: str) -> None:
        print(f"User data received: {user_data_script}")
        # Assertion to ensure the instance has user data configured
        assert user_data_script is not None, 'EC2 instance must have user_data_script configured'
    
    # Running the test to check the user data script
    pulumi.runtime.run_in_stack(lambda: infra.ec2_instance.user_data.apply(check_user_data))

Github Actions

Introduction

GitHub Actions is a powerful automation tool integrated within GitHub, enabling developers to automate their workflows, including testing, building, and deploying code. Pulumi, on the other hand, is an Infrastructure as Code (IaC) tool that allows you to manage cloud resources using familiar programming languages. In this post, we’ll explore why you should use GitHub Actions and its specific purpose when combined with Pulumi.

Why Use GitHub Actions and Its Importance

GitHub Actions is a powerful tool for automating workflows within your GitHub repository, offering several key benefits, especially when combined with Pulumi:

Integrated CI/CD: GitHub Actions seamlessly integrates Continuous Integration and Continuous Deployment (CI/CD) directly into your GitHub repository. This automation enhances consistency in testing, building, and deploying code, reducing the risk of manual errors.
Custom Workflows: It allows you to create custom workflows for different stages of your software development lifecycle, such as code linting, running unit tests, or managing complex deployment processes. This flexibility ensures your automation aligns with your specific needs.
Event-Driven Automation: You can trigger GitHub Actions with events like pushes, pull requests, or issue creation. This event-driven approach ensures that tasks are automated precisely when needed, streamlining your workflow.
Reusable Code: GitHub Actions supports reusable “actions” that can be shared across multiple workflows or repositories. This promotes code reuse and maintains consistency in automation processes.
Built-in Marketplace: The GitHub Marketplace offers a wide range of pre-built actions from the community, making it easy to integrate third-party services or implement common tasks without writing custom code.
Enhanced Collaboration: By using GitHub’s pull request and review workflows, teams can discuss and approve changes before deployment. This process reduces risks and improves collaboration on infrastructure changes.
Automated Deployment: GitHub Actions automates the deployment of infrastructure code, using Pulumi to apply changes. This automation reduces the risk of manual errors and ensures a consistent deployment process.
Testing: Running tests before deploying with GitHub Actions helps confirm that your infrastructure code works correctly, catching potential issues early and ensuring stability.
Configuration Management: It manages and sets up necessary configurations for Pulumi and AWS, ensuring your environment is correctly configured for deployments.
Preview and Apply Changes: GitHub Actions allows you to preview changes before applying them, helping you understand the impact of modifications and minimizing the risk of unintended changes.
Cleanup: You can optionally destroy the stack after testing or deployment, helping control costs and maintain a clean environment.

Execution

To execute the GitHub Actions workflow:

Placement: Save the workflow YAML file in your repository’s .github/workflows directory. This setup ensures that GitHub Actions will automatically detect and execute the workflow whenever there’s a push to the main branch of your repository.
Workflow Actions: The workflow file performs several critical actions:
- Environment Setup: Configures the necessary environment for running the workflow.
- Dependency Installation: Installs the required dependencies, including Pulumi CLI and other Python packages.
- Testing: Runs your tests to verify that your infrastructure code functions as expected.
- Preview and Apply Changes: Uses Pulumi to preview and apply any changes to your infrastructure.
- Cleanup: Optionally destroys the stack after tests or deployment to manage costs and maintain a clean environment.

By incorporating this workflow, you ensure that your Pulumi infrastructure is continuously integrated and deployed with proper validation, significantly improving the reliability and efficiency of your infrastructure management process.

Example: Deploy infrastructure with Pulumi


name: Pulumi Deployment

on:
  push:
    branches:
      - main

env:
  # Environment variables for AWS credentials and private key.
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
  PRIVATE_KEY: ${{ secrets.PRIVATE_KEY }}

jobs:
  pulumi-deploy:
    runs-on: ubuntu-latest
    environment: dev

    steps:
      - name: Checkout code
        uses: actions/checkout@v3
        # Check out the repository code to the runner.

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
        # Set up AWS credentials for use in subsequent actions.

      - name: Set up SSH key
        run: |
          mkdir -p ~/.ssh
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/XYZ.pem
          chmod 600 ~/.ssh/XYZ.pem
        # Create an SSH directory, add the private SSH key, and set permissions.

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
        # Set up Python 3.9 environment for running Python-based tasks.
  
      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '14'
        # Set up Node.js 14 environment for running Node.js-based tasks.

      - name: Install project dependencies
        run: npm install
        working-directory: .
        # Install Node.js project dependencies specified in `package.json`.
      
      - name: Install Pulumi
        run: npm install -g pulumi
        # Install the Pulumi CLI globally.

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
        working-directory: .
        # Upgrade pip and install Python dependencies from `requirements.txt`.

      - name: Login to Pulumi
        run: pulumi login
        env:
          PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_ACCESS_TOKEN }}
        # Log in to Pulumi using the access token stored in secrets.
        
      - name: Set Pulumi configuration for tests
        run: pulumi config set aws:region eu-central-1 --stack dev
        # Set Pulumi configuration to specify AWS region for the `dev` stack.

      - name: Pulumi stack select
        run: pulumi stack select dev  
        working-directory: .
        # Select the `dev` stack for Pulumi operations.

      - name: Run tests
        run: |
          pulumi config set aws:region eu-central-1
          pytest
        working-directory: .
        # Set AWS region configuration and run tests using pytest.
    
      - name: Preview Pulumi changes
        run: pulumi preview --stack dev
        working-directory: .
        # Preview the changes that Pulumi will apply to the `dev` stack.
    
      - name: Update Pulumi stack
        run: pulumi up --yes --stack dev
        working-directory: . 
        # Apply the changes to the `dev` stack with Pulumi.

      - name: Pulumi stack output
        run: pulumi stack output
        working-directory: .
        # Retrieve and display outputs from the Pulumi stack.

      - name: Cleanup Pulumi stack
        run: pulumi destroy --yes --stack dev
        working-directory: . 
        # Destroy the `dev` stack to clean up resources.

      - name: Pulumi stack output (after destroy)
        run: pulumi stack output
        working-directory: .
        # Retrieve and display outputs from the Pulumi stack after destruction.

      - name: Logout from Pulumi
        run: pulumi logout
        # Log out from the Pulumi session.

Output:

Finally, let’s take a look at how everything appears both in GitHub and in AWS. Check out the screenshots below to see the GitHub Actions workflow in action and the resulting AWS resources.

GitHub Actions workflow

AWS FSx resource

AWS FSx storage virtual machine

Outlook: Exploring Advanced Features of Pulumi and Amazon FSx for NetApp ONTAP

As you become more comfortable with Pulumi and Amazon FSx for NetApp ONTAP, there are numerous advanced features and capabilities to explore. These can significantly enhance your infrastructure automation and storage management strategies. In a follow-up blog post, we will delve into these advanced topics, providing a deeper understanding and practical examples.

Advanced Features of Pulumi

Cross-Cloud Infrastructure Management
Pulumi supports multiple cloud providers, including AWS, Azure, Google Cloud, and Kubernetes. In more advanced scenarios, you can manage resources across different clouds in a single Pulumi project, enabling true multi-cloud and hybrid cloud architectures.
Component Resources
Pulumi allows you to create reusable components by grouping related resources into custom classes. This is particularly useful for complex deployments where you want to encapsulate and reuse configurations across different projects or environments.
Automation API
Pulumi’s Automation API enables you to embed Pulumi within your own applications, allowing for infrastructure to be managed programmatically. This can be useful for building custom CI/CD pipelines or integrating with other systems.
Policy as Code with Pulumi CrossGuard
Pulumi CrossGuard allows you to enforce compliance and security policies across your infrastructure using familiar programming languages. Policies can be applied to ensure that resources adhere to organizational standards, improving governance and reducing risk.
Stack References and Dependency Management
Pulumi’s stack references enable you to manage dependencies between different Pulumi stacks, allowing for complex, interdependent infrastructure setups. This is crucial for large-scale environments where components must interact and be updated in a coordinated manner.

Advanced Features of Amazon FSx for NetApp ONTAP

Data Protection and Snapshots
FSx for NetApp ONTAP offers advanced data protection features, including automated snapshots, SnapMirror for disaster recovery, and integration with AWS Backup. These features help safeguard your data and ensure business continuity.
Data Tiering and Cost Optimization
FSx for ONTAP includes intelligent data tiering, which automatically moves infrequently accessed data to lower-cost storage. This feature is vital for optimizing costs, especially in environments with large amounts of data that have varying access patterns.
Multi-Protocol Access and CIFS/SMB Integration
FSx for ONTAP supports multiple protocols, including NFS, SMB, and iSCSI, enabling seamless access from both Linux/Unix and Windows clients. This is particularly useful in mixed environments where applications or users need to access the same data using different protocols.
Performance Tuning and Quality of Service (QoS)
FSx for ONTAP allows you to fine-tune performance parameters and implement QoS policies, ensuring that critical workloads receive the necessary resources. This is essential for applications with stringent performance requirements.
ONTAP System Manager and API Integration
Advanced users can leverage the ONTAP System Manager or integrate with NetApp’s extensive API offerings to automate and customize the management of FSx for ONTAP. This level of control is invaluable for organizations looking to tailor their storage solutions to specific needs.

What’s Next?

In the next blog post, we will explore these advanced features in detail, providing practical examples and use cases. We’ll dive into multi-cloud management with Pulumi, demonstrate the creation of reusable infrastructure components, and explore how to enforce security and compliance policies with Pulumi CrossGuard. Additionally, we’ll examine advanced data management strategies with FSx for NetApp ONTAP, including snapshots, data tiering, and performance optimization.

Stay tuned as we take your infrastructure as code and cloud storage management to the next level!

Conclusion

This example demonstrates how Pulumi can be used to manage AWS infrastructure using Python. By defining resources like VPCs, subnets, security groups, and FSx file systems in code, you can version control your infrastructure and easily reproduce environments.

Amazon FSx for NetApp ONTAP offers a powerful and flexible solution for running file-based workloads in the cloud, combining the strengths of AWS and NetApp ONTAP. Pulumi’s ability to leverage existing programming languages for infrastructure management allows for more complex logic and better integration with your development workflows. However, it requires familiarity with these languages and has a smaller ecosystem compared to Terraform. Despite these differences, Pulumi is a powerful tool for managing modern cloud infrastructure.

Disclaimer

The information provided in this blog post is for educational and informational purposes only. The features and capabilities of Pulumi and Amazon FSx for NetApp ONTAP mentioned are subject to change as new versions and updates are released. While we strive to ensure that the content is accurate and up-to-date, we cannot guarantee that it reflects the latest changes or improvements. Always refer to the official documentation and consult with your cloud provider or technology partner for the most current and relevant information. The author and publisher of this blog post are not responsible for any errors or omissions, or for any actions taken based on the information provided.

Veeam & Proxmox VE

Veeam has made a strategic move by integrating the open-source virtualization solution Proxmox VE (Virtual Environment) into its portfolio. Signaling its commitment into the evolving needs of the open-source community and the open-source virtualization market, this integration positions Veeam as a forward-thinking player in the industry, ready to support the rising tide of open-source solutions. The combination of Veeam’s data protection solutions with the flexibility of Proxmox VE’s platform offers enterprises a compelling alternative that promises cost savings and enhanced data security.

With the Proxmox VE, now also one of the most important and often requested open-source solution and hypervisor is being natively supported – and it could definitely make a turn in the virtualization market!

Opportunities for Open-Source Virtualization

In many enterprises, a major hypervisor platform is already in place, accompanied by a robust backup solution – often Veeam. However, until recently, Veeam lacked direct support for Proxmox VE, leaving a gap for those who have embraced or are considering this open-source virtualization platform. The latest version of Veeam changes the game by introducing the capability to create and manage backups and restores directly within Proxmox VE environments, without the need for agents inside the VMs.

This advancement means that entire VMs can now be backed up and restored across any hypervisor, providing unparalleled flexibility. Moreover, enterprises can seamlessly integrate a new Proxmox VE-based cluster into their existing Veeam setup, managing everything from a single, central point. This integration simplifies operations, reduces complexity, and enhances the overall efficiency of data protection strategies in environments that include multiple hypervisors by simply having a one-fits-all solution in place.

Also, an heavily underestimated benefit, offers the possibilities to easily migrate, copy, backup and restore entire VMs even independent of their underlying hypervisor – also known as cross platform recovery. As a result, operators are now able to shift VMs from VMware ESXi nodes / vSphere, or Hyper-V to Proxmox VE nodes. This provides a great solution to introduce and evaluate a new virtualization platform without taking any risks. For organizations looking to unify their virtualization and backup infrastructure, this update offers a significant leap forward.

Integration into Veeam

Integrating a new Proxmox cluster into an existing Veeam setup is a testament to the simplicity and user-centric design of both systems. Those familiar with Veeam will find the process to be intuitive and minimally disruptive, allowing for a seamless extension of their virtualization environment. This ease of integration means that your new Proxmox VE cluster can be swiftly brought under the protective umbrella of Veeam’s robust backup and replication services.

Despite the general ease of the process, it’s important to recognize that unique configurations and specific environments may present their own set of challenges. These corner cases, while not common, are worth noting as they can require special attention to ensure a smooth integration. Rest assured, however, that these are merely nuances in an otherwise straightforward procedure, and with a little extra care, even these can be managed effectively.

Overview

Starting with version 12.2, the Proxmox VE support is enabled and integrated by a plugin which gets installed on the Veeam Backup server. Veeam Backup for Proxmox incorporates a distributed architecture that necessitates the deployment of worker nodes. These nodes function analogously to data movers, facilitating the transfer of virtual machine payloads from the Proxmox VE hosts to the designated Backup Repository. The workers operate on a Linux platform and are seamlessly instantiated via the Veeam Backup Server console. Their role is critical and akin to that of proxy components in analogous systems such as AHV or VMware backup solutions.

Such a worker is needed at least once in a cluster. For improved performance, one worker for each Proxmox VE node might be considered. Each worker requires 6 vCPU, 6 GB memory and 100 GB disk space which should be kept in mind.

Requirements

This blog post assumes that an already present installation of Veeam Backup & Replication in version 12.2 or later is already in place and fully configured for another environment such like VMware. It also assumes that the Proxmox VE cluster is already present and a credential with the needed roles to perform the backup/restore actions is given.

Configuration

The integration and configuration of a Proxmox VE cluster can be fully done within the Veeam Backup & Replication Console application and does not require any additional commands on any cli to be executed. The previously mentioned worker nodes can be installed fully automated.

Adding a Proxmox Server

To integrate a new Proxmox Server into the Veeam Backup & Replication environment, one must initiate the process by accessing the Veeam console. Subsequently, navigate through the designated sections to complete the addition:

Virtual Infrastructure -> Add Server

This procedure is consistent with the established protocol for incorporating nodes from other virtualization platforms that are compatible with Veeam.

Afterwards, Veeam shows you a selection of possible and supported Hypervisors:

VM vSphere
Microsoft Hyper-V
Nutanix AHV
RedHat Virtualization
Oracle Virtualization Manager
Proxmox VE

In this case we simply choose Proxmox VE and proceed the setup wizard.

During the next steps in the setup wizard, the authentication details, the hostname or IP address of the target Proxmox VE server and also a snapshot storage of the Proxmox VE server must be defined.

Hint: When it comes to the authentication details, take care to use functional credentials for the SSH service on the Proxmox VE server. If you usually use the root@pam credentials for the web interface, you simply need to prompt root to Veeam. Veeam will initiate a connection to the system over the ssh protocol.

In one of the last surveys of the setup wizard, Veeam offers to automatically install the required worker node. Such a worker node is a small sized VM that is running inside the cluster on the targeted Proxmox VE server. In general, a single worker node for a cluster in enough but to enhance the overall performance, one worker for each node is recommended.

Usage

Once the Proxmox VE server has been successfully integrated into the Veeam inventory, it can be managed as effortlessly as any other supported hypervisor, such as VMware vSphere or Microsoft Hyper-V. A significant advantage, as shown in the screenshot, is the capability to centrally administrate various hypervisors and servers in clusters. This eliminates the necessity for a separate Veeam instance for each cluster, streamlining operations. Nonetheless, there may be specific scenarios where individual setups for each cluster are preferable.

As a result, this does not only simplify the operator’s work when working with different servers and clusters but also provides finally the opportunity for cross-hypervisor-recoveries.

Creating Backup Jobs

Creating a new backup job for a single VM or even multiple VMs in a Proxmox environment is as simple and exactly the same way, like you already know for other hypervisors. However, let us have a quick summary about the needed tasks:

Open the Veeam Backup & Replication console on your backup server or management workstation. To start creating a backup job, navigate to the Home tab and click on Backup Job, then select Virtual machine from the drop-down menu.

When the New Backup Job wizard opens, you will need to enter a name and a description for the backup job. Click Next to proceed to the next step. Now, you will need to select the VMs that you want to back up. Click Add in the Virtual Machines step and choose the individual VMs or containers like folders, clusters, or entire hosts that you want to include in the backup. Once you have made your selection, click Next.

The next step is to specify where you want to store the backup files. In the Storage step, select the backup repository and decide on the retention policy that dictates how long you want to keep the backup data. After setting this up, click Next.

If you have configured multiple backup proxies, the next step allows you to specify which one to use. If you are not sure or if you prefer, you can let Veeam Backup & Replication automatically select the best proxy for the job. Click Next after making your choice.

Now it is time to schedule when the backup job should run. In the Schedule step, you can set up the job to run automatically at specific times or in response to certain events. After configuring the schedule, click Next.

Review all the settings on the summary page to ensure they are correct. If everything looks good, click Finish to create the backup job.

If you want to run the backup job immediately for ensuring everything works as expected, you can do so by right-clicking on the job and selecting Start. Alternatively, you can wait for the scheduled time to trigger the job automatically.

Restoring an entire VM

The restore and replication process for a full VM restore remains to the standard procedures. However, it now includes the significant feature of cross-hypervisor restore. This functionality allows for the migration of VMs between different hypervisor types without compatibility issues. For example, when introducing Proxmox VE into a corporate setting, operators can effortlessly migrate VMs from an existing hypervisor to the Proxmox VE cluster. Should any issues arise during the testing phase, the process also supports the reverse migration back to the original hypervisor. Let us have a look at the details.

Choose the Entire VM restore option, which will launch the wizard for restoring a full virtual machine. The first step in the wizard will ask you to select a backup from which you want to restore. You will see a list of available backups; select the one that contains the VM you wish to restore and proceed to the next step by clicking Next.

Now, you must decide on the restore point. Typically, this will be the most recent backup, but you may choose an earlier point if necessary. After selecting the restore point, continue to the next step.

The wizard will then prompt you to specify the destination for the VM. This is the very handy point for cross-hypervisor-restore where this could be the original location or a new location if you are performing a migration or don’t want to overwrite the existing VM. Configure the network settings as required, ensuring that the restored VM will have the appropriate network access.

In the next step, you will have options regarding the power state of the VM after the restoration. You can choose to power on the VM automatically or leave it turned off, depending on your needs.

Before finalizing the restore process, review all the settings to make sure they align with your intended outcome. This is your chance to go back and make any necessary adjustments. Once you’re satisfied with the configuration, proceed to restore the VM by clicking Finish.

The restoration process will begin, and its progress can be monitored within the Veeam Backup & Replication console. Depending on the size of the VM and the performance of your backup storage and network, the restoration can take some time.

File-Level-Restore

Select Restore guest files. The wizard for file-level recovery will start, guiding you through the necessary steps. The first step involves choosing the VM backup from which you want to restore files. Browse through the list of available backups, select the appropriate one, and then click Next to proceed.

Choose the restore point that you want to use for the file-level restore. This is typically the most recent backup, but you can select an earlier one if needed. After picking the restore point, click Next to continue.

At this stage, you may need to choose the operating system of the VM that you are restoring files from. This is particularly important if the backup is of a different OS than the one on the Veeam Backup & Replication server because it will determine the type of helper appliance required for the restore.

Veeam Backup & Replication will prompt you to deploy a helper appliance if the backup is from an OS that is not natively supported by the Windows-based Veeam Backup & Replication server. Follow the on-screen instructions to deploy the helper appliance, which will facilitate the file-level restore process.

Once the helper appliance is ready, you will be able to browse the file system of the backup. Navigate through the backup to locate the files or folders you wish to restore.

After selecting the files or folders for restoration, you will be prompted to choose the destination where you want to restore the data. You can restore to the original location or specify a new location, depending on your requirements.

Review your selections to confirm that the correct files are being restored and to the right destination. If everything is in order, proceed with the restoration by clicking Finish.

The file-level restore process will start, and you can monitor the progress within the Veeam Backup & Replication console. The time it takes to complete the restore will depend on the size and number of files being restored, as well as the performance of your backup storage and network.

Conclusion

Summarising all the things, the latest update to Veeam introduces a very important and welcomed integration with Proxmox VE, filling a significant gap for enterprises that have adopted this open-source virtualization platform. By enabling direct backups and restores of entire VMs across different hypervisors without the need for in-VM agents, Veeam now offers unparalleled flexibility and simplicity in managing mixed environments. This advancement not only streamlines operations and enhances data protection strategies but also empowers organizations to easily migrate and evaluate new open-source virtualization platforms like Proxmox VE with minimal risk. It is great to see that more and more companies are putting efforts into supporting open-source solutions which underlines the ongoing importance of open-source based products in enterprises.

Additionally, for those starting fresh with Proxmox, the Proxmox Backup Server remains a viable open-source alternative and you can find our blog post about configuring the Proxmox Backup Server right here. Overall, this update represents a significant step forward in unifying virtualization and backup infrastructures, offering both versatility and ease of integration.

We are always here to help and assist you with further consulting, planning, and integration needs. Whether you are exploring new virtualization platforms, optimizing your current infrastructure, or looking for expert guidance on your backup strategies, our team is dedicated to ensuring your success every step of the way. Do not hesitate to reach out to us for personalized support and tailored solutions to meet your unique requirements in virtualization- or backup environments.

In the world of virtualization, ensuring data redundancy and high availability is crucial. Proxmox Virtual Environment (PVE) is a powerful open-source platform for enterprise virtualization, combining KVM hypervisor and LXC containers. One of the key features that Proxmox offers is local storage replication, which helps in maintaining data integrity and availability in case of hardware failures. In this blog post, we will delve into the concept of local storage replication in Proxmox, its benefits, and how to set it up.

What is Local Storage Replication?

Local storage replication in Proxmox refers to the process of duplicating data from one local storage device to another within the same Proxmox cluster. This ensures that if one storage device fails, the data is still available on another device, thereby minimizing downtime and data loss. This is particularly useful in environments where high availability is critical.

Benefits

Data Redundancy: By replicating data across multiple storage devices, you ensure that a copy of your data is always available, even if one device fails.
High Availability: In the event of hardware failure, the system can quickly switch to the replicated data, ensuring minimal disruption to services.

Caveat

Please note that data loss may occur between the last synchronization of the data and the failure of the node. Otherwise use shared storage (Ceph, NFS, …) in a cluster if you can not tolerate any small data loss.

Setting Up Local Storage Replication in Proxmox

Setting up local storage replication in Proxmox involves a few steps. Here’s a step-by-step guide to help you get started:

Step 1: Prepare Your Environment

Ensure that you have a Proxmox cluster set up with at least two nodes. Each node should have local ZFS storage configured.

Step 2: Configure Storage Replication

Access the Proxmox Web Interface: Log in to the Proxmox web interface.
Navigate to Datacenter: In the left-hand menu, click on Datacenter.
Select Storage: Under the Datacenter menu, click on Storage.
Add Storage: Click on Add and select the type of storage you want to replicate.
Configure Storage: Fill in the required details for the ZFS storage (one local storage per node).

Step 3: Set Up Replication

Navigate to the Node: In the left-hand menu, select the node where you want to set up replication.
Select the VM/CT: Click on the virtual machine (VM) or container (CT) you want to replicate.
Configure Replication: Go to the Replication tab and click on Add.
Select Target Node: Choose the target node where the data will be replicated to.
Schedule Replication: Set the replication schedule according to your needs (e.g. every 5 minutes, hourly).

Step 4: Monitor Replication

Once replication is set up, you can monitor its status in the Replication tab. Proxmox provides detailed logs and status updates to help you ensure that replication is functioning correctly.

Best Practices for Local Storage Replication

Regular Backups: While replication provides redundancy, it is not a substitute for regular backups. Ensure that you have a robust backup strategy in place. Use tools like the Proxmox Backup Server (PBS) for this task.
Monitor Storage Health: Regularly check the health of your storage devices to preemptively address any issues.
Test Failover: Periodically test the failover process to ensure that your replication setup works as expected in case of an actual failure.
Optimize Replication Schedule: Balance the replication frequency with your performance requirements and network bandwidth to avoid unnecessary load.

Conclusion

Local storage replication in Proxmox is a powerful feature that enhances data redundancy and high availability. By following the steps outlined in this blog post, you can set up and manage local storage replication in your Proxmox environment, ensuring that your data remains safe and accessible even in the face of hardware failures. Remember to follow best practices and regularly monitor your replication setup to maintain optimal performance and reliability.

You can find further information here about the Proxmox storage replication:

https://pve.proxmox.com/wiki/Storage_Replication
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

Happy virtualizing!

If you had the choice, would you rather take Salsa or Guacamole? Let me explain, why you should choose Guacamole over Salsa.

In this blog article, we want to take a look at one of the smaller Apache projects out there called Apache Guacamole. Apache Guacamole allows administrators to run a web based client tool for accessing remote applications and servers. This can include remote desktop systems, applications or terminal sessions. Users can simply access them by using their web browsers. No special client or other tools are required. From there, they can login and access all pre-configured remote connections that have been specified by an administrator.

Thereby, Guacamole supports a wide variety of protocols like VNC, RDP, and SSH. This way, users can basically access anything from remote terminal sessions to full fledged Graphical User Interfaces provided by operation systems like Debian, Ubuntu, Windows and many more.

Convert every window application to a web application

If we spin this idea further, technically every window application that isn’t designed to run as an web application can be transformed to a web application by using Apache Guacamole. We helped a customer to bring its legacy application to Kubernetes, so that other users could use their web browsers to run it. Sure, implementing the application from ground up, so that it follows the Cloud Native principles, is the preferred solution. As always though, efforts, experience and costs may exceed the available time and budget and in that cases, Apache Guacamole can provide a relatively easy way for realizing such projects.

In this blog article, I want to show you, how easy it is to run a legacy window application as a web app on Kubernetes. For this, we will use a Kubernetes cluster created by kind and create a Kubernetes Deployment to make kate – a KDE based text editor – our own web application. It’s just an example, so there might be better application to transform but this one should be fine to show you the concepts behind Apache Guacamole.

So, without further ado, let’s create our kate web application.

Preparation of Kubernetes

Before we can start, we must make sure that we have a Kubernetes cluster, that we can test on. If you already have a cluster, simply skip this section. If not, let’s spin one up by using kind.

kind is a lightweight implementation of Kubernetes that can be run on every machine. It’s written in Go and can be installed like this:

# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
# For ARM64
[ $(uname -m) = aarch64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-arm64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

Next, we need to install some dependencies for our cluster. This includes for example docker and kubectl.

$ sudo apt install docker.io kubernetes-client

By creating our Kubernetes Cluster with kind, we need docker because the Kubernetes cluster is running within Docker containers on your host machine. Installing kubectl allows us to access the Kubernetes after creating it.

Once we installed those packages, we can start to create our cluster now. First, we must define a cluster configuration. It defines which ports are accessible from our host machine, so that we can access our Guacamole application. Remember, the cluster itself is operated within Docker containers, so we must ensure that we can access it from our machine. For this, we define the following configuration which we save in a file called cluster.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30000
    hostPort: 30000
    listenAddress: "127.0.0.1"
    protocol: TCP

Hereby, we basically map the container’s port 30000 to our local machine’s port 30000, so that we can easily access it later on. Keep this in mind because it will be the port that we will use with our web browser to access our kate instance.

Ultimately, this configuration is consumed by kind . With it, you can also adjust multiple other parameters of your cluster besides of just modifying the port configuration which are not mentioned here. It’s worth to take a look kate’s documentation for this.

As soon as you saved the configuration to cluster.yaml, we can now start to create our cluster:

$ sudo kind create cluster --name guacamole --config cluster.yaml
Creating cluster "guacamole" ...
 ✓ Ensuring node image (kindest/node:v1.29.2) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-guacamole"
You can now use your cluster with:

kubectl cluster-info --context kind-guacamole

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂

Since we don’t want to run everything in root context, let’s export the kubeconfig, so that we can use it with kubectl by using our unpriviledged user:

$ sudo kind export kubeconfig \
    --name guacamole \
    --kubeconfig $PWD/config

$ export KUBECONFIG=$PWD/config
$ sudo chown $(logname): $KUBECONFIG

By doing so, we are ready and can access our Kubernetes cluster using kubectl now. This is our baseline to start migrating our application.

Creation of the Guacamole Deployment

In order to run our application on Kubernetes, we need some sort of workload resource. Typically, you could create a Pod, Deployment, Statefulset or Daemonset to run workloads on a cluster.

Let’s create the Kubernetes Deployment for our own application. The example shown below shows the deployment’s general structure. Each container definition will have their dedicated examples afterwards to explain them in more detail.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: web-based-kate
  name: web-based-kate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web-based-kate
  template:
    metadata:
      labels:
        app: web-based-kate
    spec:
      containers:
      # The guacamole server component that each
      # user will connect to via their browser
      - name: guacamole-server
        image: docker.io/guacamole/guacamole:1.5.4
        ...
      # The daemon that opens the connection to the
      # remote entity
      - name: guacamole-guacd
        image: docker.io/guacamole/guacd:1.5.4
        ...
      # Our own self written application that we
      # want to make accessible via the web.
      - name: web-based-kate
        image: registry.example.com/own-app/web-based-kate:0.0.1
        ...
      volumes:
        - name: guacamole-config
          secret:
            secretName: guacamole-config
        - name: guacamole-server
          emptyDir: {}
        - name: web-based-kate-home
          emptyDir: {}
        - name: web-based-kate-tmp
          emptyDir: {}

As you can see, we need three containers and some volumes for our application. The first two containers are dedicated to Apache Guacamole itself. First, it’s the server component which is the external endpoint for clients to access our web application. It serves the web server as well as the user management and configuration to run Apache Guacamole.

Next to this, there is the guacd daemon. This is the core component of Guacamole which creates the remote connections to the application based on the configuration done to the server. This daemon forwards the remote connection to the clients by making it accessible to the Guacamole server which then forwards the connection to the end user.

Finally, we have our own application. It will offer a connection endpoint to the guacd daemon using one of Guacamole’s supported protocols and provide the Graphical User Interface (GUI).

Guacamole Server

Now, let’s deep dive into each container specification. We are starting with the Guacamole server instance. This one handles the session and user management and contains the configuration which defines what remote connections are available and what are not.

- name: guacamole-server
  image: docker.io/guacamole/guacamole:1.5.4
  env:
    - name: GUACD_HOSTNAME
      value: "localhost"
    - name: GUACD_PORT
      value: "4822"
    - name: GUACAMOLE_HOME
      value: "/data/guacamole/settings"
    - name: HOME
      value: "/data/guacamole"
    - name: WEBAPP_CONTEXT
      value: ROOT
  volumeMounts:
    - name: guacamole-config
      mountPath: /data/guacamole/settings
    - name: guacamole-server
      mountPath: /data/guacamole
  ports:
    - name: http
      containerPort: 8080
  securityContext:
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
    capabilities:
      drop: ["all"]
  resources:
    limits:
      cpu: "250m"
      memory: "256Mi"
    requests:
      cpu: "250m"
      memory: "256Mi"

Since it needs to connect to the guacd daemon, we have to provide the connection information for guacd by passing them into the container using environment variables like GUACD_HOSTNAME or GUACD_PORT. In addition, Guacamole would usually be accessible via http://<your domain>/guacamole.

This behavior however can be adjusted by modifying the WEBAPP_CONTEXT environment variable. In our case for example, we don’t want a user to type in /guacamole to access it but simply using it like this http://<your domain>/

Guacamole Guacd

Then, there is the guacd daemon.

- name: guacamole-guacd
  image: docker.io/guacamole/guacd:1.5.4
  args:
    - /bin/sh
    - -c
    - /opt/guacamole/sbin/guacd -b 127.0.0.1 -L $GUACD_LOG_LEVEL -f
  securityContext:
    allowPrivilegeEscalation: true
    privileged: false
    readOnlyRootFileSystem: true
    capabilities:
      drop: ["all"]
  resources:
    limits:
      cpu: "250m"
      memory: "512Mi"
    requests:
      cpu: "250m"
      memory: "512Mi"

It’s worth mentioning that you should modify the arguments used to start the guacd container. In the example above, we want guacd to only listen to localhost for security reasons. All containers within the same pod share the same network namespace. As a a result, they can access each other via localhost. This said, there is no need to make this service accessible to over services running outside of this pod, so we can limit it to localhost only. To achieve this, you would need to set the -b 127.0.0.1 parameter which sets the corresponding listen address. Since you need to overwrite the whole command, don’t forget to also specify the -L and -f parameter. The first parameter sets the log level and the second one set the process in the foreground.

Web Based Kate

To finish everything off, we have the kate application which we want to transform to a web application.

- name: web-based-kate
  image: registry.example.com/own-app/web-based-kate:0.0.1
  env:
    - name: VNC_SERVER_PORT
      value: "5900"
    - name: VNC_RESOLUTION_WIDTH
      value: "1280"
    - name: VNC_RESOLUTION_HEIGHT
      value: "720"
  securityContext:
    allowPrivilegeEscalation: true
    privileged: false
    readOnlyRootFileSystem: true
    capabilities:
      drop: ["all"]
  volumeMounts:
    - name: web-based-kate-home
      mountPath: /home/kate
    - name: web-based-kate-tmp
      mountPath: /tmp

Configuration of our Guacamole setup

After having the deployment in place, we need to prepare the configuration for our Guacamole setup. In order to know, what users exist and which connections should be offered, we need to provide a mapping configuration to Guacamole.

In this example, a simple user mapping is shown for demonstration purposes. It uses a static mapping defined in a XML file that is handed over to the Guacamole server. Typically, you would use other authentication methods instead like a database or LDAP.

This said however, let’s continue with our static one. For this, we simply define a Kubernetes Secret which is mounted into the Guacamole server. Hereby, it defines two configuration files. One is the so called guacamole.properties. This is Guacamole’s main configuration file. Next to this, we also define the user-mapping.xml which contains all available users and their connections.

apiVersion: v1
kind: Secret
metadata:
  name: guacamole-config
stringData:
  guacamole.properties: |
    enable-environment-properties: true
  user-mapping.xml: |
    <user-mapping>
      <authorize username="admin" password="PASSWORD" encoding="sha256">
        <connection name="web-based-kate">
          <protocol>vnc</protocol>
          <param name="hostname">localhost</param>
          <param name="port">5900</param>
        </connection>
      </authorize>
    </user-mapping>

As you can see, we only defined on specific user called admin which can use a connection called web-based-kate. In order to access the kate instance, Guacamole would use VNC as the configured protocol. To make this happen, our web application must offer a VNC Server port on the other side, so that the guacd daemon can then access it to forward the remote session to clients. Keep in mind that you need to replace the string PASSWORD to a proper sha256 sum which contains the password. The sha256 sum could look like this for example:

$ echo -n "test" | sha256sum
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08  -

Next, the hostname parameter is referencing the corresponding VNC server of our kate container. Since we are starting our container alongside with our Guacamole containers within the same pod, the Guacamole Server as well as the guacd daemon can access this application via localhost. There is no need to set up a Kubernetes Service in front of it since only guacd will access the VNC server and forward the remote session via HTTP to clients accessing Guacamole via their web browsers. Finally, we also need to specify the VNC server port which is typically 5900 but this could be adjusted if needed.

The corresponding guacamole.properties is quite short. By enabling the enabling-environment-properties configuration parameter, we make sure that every Guacamole configuration parameter can also be set via environment variables. This way, we don’t need to modify this configuration file each and every time when we want to adjust the configuration but we only need to provide updated environment variables to the Guacamole server container.

Make Guacamole accessible

Last but not least, we must make the Guacamole server accessible for clients. Although each provided service can access each other via localhost, the same does not apply to clients trying to access Guacamole. Therefore, we must make Guacamole’s server port 8080 available to the outside world. This can be achieved by creating a Kubernetes Service of type NodePort. This service is forwarding each request from a local node port to the corresponding container that is offering the configured target port. In our case, this would be the Guacamole server container which is offering port 8080.

apiVersion: v1
kind: Service
metadata:
  name: web-based-kate
spec:
  type: NodePort
  selector:
    app: web-based-kate
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 8080
      nodePort: 30000

This specific port is then mapped to the Node’s 30000 port for which we also configured the kind cluster in such a way that it forwards its node port 30000 to the host system’s port 30000. This port is the one that we would need to use to access Guacamole with our web browsers.

Prepartion of the Application container

Before we can start to deploy our application, we need to prepare our kate container. For this, we simply create a Debian container that is running kate. Keep in mind that you would typically use lightweight base images like alpine to run applications like this. For this demonstration however, we use the Debian images since it is easier to spin it up but in general you only need a small friction of the functionality that is provided by this base image. Moreover – from an security point of view – you want to keep your images small to minimize the attack surface and make sure it is easier to maintain. For now however, we will continue with the Debian image.

In the example below, you can see a Dockerfile for the kate container.

FROM debian:12

# Install all required packages
RUN apt update && \
    apt install -y x11vnc xvfb kate

# Add user for kate
RUN adduser kate --system --home /home/kate -uid 999

# Copy our entrypoint in the container
COPY entrypoint.sh /opt

USER 999
ENTRYPOINT [ "/opt/entrypoint.sh" ]

Here you see that we create a dedicated user called kate (User ID 999) for which we also create a home directory. This home directory is used for all files that kate is creating during runtime. Since we set the readOnlyRootFilesystem to true, we must make sure that we mount some sort of writable volume (e.g EmptyDir) to kate’s home directory. Otherwise, kate wouldn’t be able to write any runtime data then.

Moreover, we have to install the following three packages:

x11vnc
xvfb
kate

These are the only packages we need for our container. In addition, we also need to create an entrypoint script to start the application and prepare the container accordingly. This entrypoint script creates the configuration for kate, starts it in a virtual display by using xvfb-run and provides this virtual display to end users by using the VNC server via x11vnc. In the meantime, xdrrinfo is used to check if the virtual display came up successfully after starting kate. If it takes to long, the entrypoint script will fail by returning the exit code 1.

By doing this, we ensure that the container is not stuck in an infinite loop during a failure and let Kubernetes restart the container whenever it couldn’t start the application successfully. Furthermore, it is important to check if the virtual display came up prior of handing it over to the VNC server because the VNC server would crash if the virtual display is not up and running since it needs something to share. On the other hand though, our container will be killed whenever kate is terminated because it would also terminate the virtual display and in the end it would then also terminate the VNC server which let’s the container exit, too. This way, we don’t need take care of it by our own.

#!/bin/bash

set -e

# If no resolution is provided
if [ -z $VNC_RESOLUTION_WIDTH ]; then
  VNC_RESOLUTION_WIDTH=1920
fi

if [ -z $VNC_RESOLUTION_HEIGHT ]; then
  VNC_RESOLUTION_HEIGHT=1080
fi

# If no server port is provided
if [ -z $VNC_SERVER_PORT ]; then
  VNC_SERVER_PORT=5900
fi

# Prepare configuration for kate
mkdir -p $HOME/.local/share/kate
echo "[MainWindow0]
"$VNC_RESOLUTION_WIDTH"x"$VNC_RESOLUTION_HEIGHT" screen: Height=$VNC_RESOLUTION_HEIGHT
"$VNC_RESOLUTION_WIDTH"x"$VNC_RESOLUTION_HEIGHT" screen: Width=$VNC_RESOLUTION_WIDTH
"$VNC_RESOLUTION_WIDTH"x"$VNC_RESOLUTION_HEIGHT" screen: XPosition=0
"$VNC_RESOLUTION_WIDTH"x"$VNC_RESOLUTION_HEIGHT" screen: YPosition=0
Active ViewSpace=0
Kate-MDI-Sidebar-Visible=false" > $HOME/.local/share/kate/anonymous.katesession

# We need to define an XAuthority file
export XAUTHORITY=$HOME/.Xauthority

# Define execution command
APPLICATION_CMD="kate"

# Let's start our application in a virtual display
xvfb-run \
  -n 99 \
  -s ':99 -screen 0 '$VNC_RESOLUTION_WIDTH'x'$VNC_RESOLUTION_HEIGHT'x16' \
  -f $XAUTHORITY \
  $APPLICATION_CMD &

# Let's wait until the virtual display is initalize before
# we proceed. But don't wait infinitely.
TIMEOUT=10
while ! (xdriinfo -display :99 nscreens); do 
  sleep 1
  let TIMEOUT-=1
done

# Now, let's make the virtual display accessible by
# exposing it via the VNC Server that is listening on
# localhost and the specified port (e.g. 5900)
x11vnc \
  -display :99 \
  -nopw \
  -localhost \
  -rfbport $VNC_SERVER_PORT \
  -forever

After preparing those files, we can now create our image and import it to our Kubernetes cluster by using the following commands:

# Do not forget to give your entrypoint script
# the proper permissions do be executed
$ chmod +x entrypoint.sh

# Next, build the image and import it into kind,
# so that it can be used from within the clusters.
$ sudo docker build -t registry.example.com/own-app/web-based-kate:0.0.1 .
$ sudo kind load -n guacamole docker-image registry.example.com/own-app/web-based-kate:0.0.1

The image will be imported to kind, so that every workload resource operated in our kind cluster can access it. If you use some other Kubernetes cluster, you would need to upload this to a registry that your cluster can pull images from.

Finally, we can also apply our previously created Kubernetes manifests to the cluster. Let’s say we saved everything to one file called kuberentes.yaml. Then, you can simply apply it like this:

$ kubectl apply -f kubernetes.yaml
deployment.apps/web-based-kate configured
secret/guacamole-config configured
service/web-based-kate unchanged

This way, a Kubernetes Deployment, Secret and Service is created which ultimately creates a Kubernetes Pod which we can access afterwards.

$ kubectl get pod
NAME                              READY   STATUS    RESTARTS   AGE
web-based-kate-7894778fb6-qwp4z   3/3     Running   0          10m

Verification of our Deployment

Now, it’s money time! After preparing everything, we should be able to access our web based kate application by using our web browser. As mentioned earlier, we configured kind in such a way that we can access our application by using our local port 30000. Every request to this port is forwarded to the kind control plane node from where it is picked up by the Kubernetes Service of type NodePort. This one is then forwarding all requests to our designated Guacamole server container which is offering the web server for accessing remote application’s via Guacamole.

If everything works out, you should be able to see the the following login screen:

After successfully login in, the remote connection is established and you should be able to see the welcome screen from kate:

If you click on New, you can create a new text file:

Those text files can even be saved but keep in mind that they will only exist as long as our Kubernetes Pod exists. Once it gets deleted, the corresponding EmptyDir, that we mounted into our kate container, gets deleted as well and all files in it are lost. Moreover, the container is set to read-only meaning that a user can only write files to the volumes (e.g. EmptyDir) that we mounted to our container.

Conclusion

After seeing that it’s relatively easy to convert every application to a web based one by using Apache Guacamole, there is only one major question left…

What do you prefer the most. Salsa or Guacamole?

Integrating Proxmox Backup Server into Proxmox Clusters

Proxmox Backup Server

In today’s digital landscape, where data reigns supreme, ensuring its security and integrity is paramount for businesses of all sizes. Enter Proxmox Backup Server, a robust solution poised to revolutionize data protection strategies with its unparalleled features and open-source nature.

At its core, Proxmox Backup Server is a comprehensive backup solution designed to safeguard critical data and applications effortlessly in virtualized environments based on Proxmox VE. Unlike traditional backup methods, Proxmox Backup Server offers a streamlined approach, simplifying the complexities associated with data backup and recovery.

One of the standout features of Proxmox Backup Server is its seamless integration with Proxmox Virtual Environment (PVE), creating a cohesive ecosystem for managing virtualized environments. This integration allows for efficient backup and restoration of Linux containers and virtual machines, ensuring minimal downtime and maximum productivity. Without the need of any backup clients on each container or virtual machine, this solution still offers the back up and restore the entire system but also single files directly from the filesystem.

Proxmox Backup Server provides a user friendly interface, making it accessible to both seasoned IT professionals and newcomers alike. With its intuitive design, users can easily configure backup tasks, monitor progress, and retrieve data with just a few clicks, eliminating the need for extensive training or technical expertise.

Data security is a top priority for businesses across industries and Proxmox Backup Server delivers on this front. Bundled with solutions like ZFS it also brings in all the enterprise filesystem features like encryption at rest, encryption at transition, checksums, snapshots, deduplication and compression but also integrating iSCSI or NFS storage from enterprise storage solutions like from NetApp can be used.

Another notable aspect of Proxmox Backup Server is its cost effectiveness. As an open-source solution, it eliminates the financial barriers (also in addition with the Proxmox VE solutions) associated with proprietary backup software.

Integrating Proxmox Backup Server into Proxmox Clusters

General

This guide expects you to have already at least one Proxmox VE system up and running and also a system where a basic installation of Proxmox Backup Server has been performed. Within this example, the Proxmox Backup Server is installed on a single disk, where the datastore gets attached to an additional block device holding the backups. Proxmox VE and Proxmox Backup Server instances must not be in the same network but must be reachable for each other. The integration requires administrative access to the datacenter of the Proxmox VE instance(s) and the Backup Server.

Prerequisites

Proxmox VE (including the datacenter).
Proxmox Backup Server (basic installation).
Administrative access to all systems.
Network reachability.
Storage device holding the backups (in this case a dedicated block storage device).

Administration: Proxmox Backup Server

Like the Proxmox VE environment, the Proxmox Backup Server comes along with a very intuitive web frontend. Unlike the web frontend of Proxmox VE, which runs on tcp/8006, the Proxmox Backup Server can be reached on tcp/8007. Therefore, all next tasks will be done on https://<IP-PROXMOX-BACKUP-SERVER>:8007.

After logging in to the web frontend, the dashboard overview welcomes the user.

Adding Datastore / Managing Storage

The initial and major tasks relies in managing the storage and adding a usable datastore for the virtualization environment holding the backup data. Therefore, we switch to the Administration chapter and click on Storage / Disks. This provides an overview of the available Devices on the Proxmox Backup Server. As already being said, this example uses a dedicated block storage device which will be used with ZFS to benefit from checksums, deduplication, compression which of course can also be used in addition with multiple disks (so called raidz-levels) or with other solutions like folder or NFS shares. Coming back to our example, we can see the empty /dev/sdb device which will be used to store all backup files.

By clicking on ZFS in the top menu bar, a ZFS trunk can be created as a datastore. Within this survey, a name, the raid level, compression and the devices to use must be defined. As already mentioned, we can attach multiple disks and define a desired raid level. The given example only consists of a single disk, which will be defined here. Compression is optional, but using LZ4 as a compression is recommended. As a lossless data compression algorithm, LZ4 aims to provide a good trade off between speed and compression ratio which is very transparent on today’s system.

Ensure to check Add as Datastore option (default) will create the given name directly as a usable datastore. In our example this will be backup01.

Keep in mind, that this part is not needed when using a NFS share. Also do not use this in addition with hardware RAID controllers.

Adding User for Backup

In a next step, a dedicated user will be created that will be used for the datastore permissions and for the Proxmox VE instances for authentication and authorization. This allows even complex setups with different datastores, different users including different access levels (e.g., reading, writing, auditing,…) on different clusters and instances. To keep it simple for demonstrations, just a single user for everything will be used.

A new user is configured by selecting Configuration, Access Control and User Management in the left menu. There, a new user can be created by simply defining a name and a password. The default realm should stay on the default for the Proxmox Backup authentication provider. Depending on the complexity of the used name schema, you may also create reasonable users. In the given example, the user is called dc01cluster22backup01.

Adding Permission of User for Datastore

Mentioning already the possibility to create complex setups regarding authentication and authorization, the datastore must be linked to at least a single user that can access it. Therefore, we go back to the Datastore and select the previously created backup01 datastore. In the top menu bar, the permissions can be created and adjusted in the Permissions chapter. Initially, a new one will be created now. Within the following survey the datastore or path, the user and the role must be defined:

Path: /datastore/backup01
User: dc01cluster22backup01@pbs
Role: DatastoreAdmin
Propagate: True

To provide a short overview of the possible roles, this will be shortly mentioned without any further explanation:

Admin
Audit
DatastoreAdmin
DatastoreAudit
DatastoreBackup
DatastorePowerUser
DatastoreReader

Administration: Proxmox VE

The integration of the backup datastore will be performed from the Proxmox VE instances via the Datacenter. As a result, the Proxmox VE web frontend will now be used for further administrative actions. The Proxmox VE web frontend runs on tcp/8006, Therefore, all next tasks will be done on https://<IP-PROXMOX-VE-SERVER>:8006.

Adding Storage

Integrating the Proxmox Backup Server works the same way like managing and adding a shared storage to a Proxmox datacenter.

In the left menu we choose the active datacenter and select the Storage options. There, we can find all natively support storage options like (NFS, SMB/CIFS, iSCSI, ZFS, GlusterFS,…) of Proxmox and finally select the Proxmox Backup Server as a dedicated item.

Afterwards, the details for adding this datastore to the datacenter must be inserted. The following options need to be defined:

ID: backup22
Server: <FQDN-OR-IP-OF-BACKUP-SERVER>
Username: dc01cluster22backup01@pbs
Password: <THE-PASSWORD-OF-THE-USER>
Enable: True
Datastore: backup01
Fingerprint: <SYSTEM-FINGERPRINT-OF-BACKUP-SERVER>

Optionally, also the Backup Retention and Encryption can be configured before adding the new backup datastore. While the backup retention can also be configured on the Proxmox Backup Server (which is recommended), enabling the encryption should be considered. Selecting an d activating the encryption is easily done by simply setting it to Auto-generate a client encryption key. Depending on your previous setup, also an already present key can be uploaded and used.

After adding this backup datastore to the datacenter, this can immediately be used for backup and the integration is finalized.

Conclusion

Proxmox provides with the Proxmox Backup Server an enterprise backup solution, for backing up Linux containers and virtual machines. Supporting features like incremental and fully deduplicated backups by using the benefits of different open-source solutions, in addition with strong encryption and data integrity this solution is a prove that open-source software can compete with closed-source enterprise software. Together with Proxmox VE, enterprise like virtualization environments can be created and managed without missing the typical enterprise feature set. Proxmox VE and the Proxmox Backup Server can also be used in addition to storage appliances from vendors like NetApp, by directly use iSCSI or NFS.

Providing this simple example, there are of course much more complex scenarios which can be created and also should be considered. We are happy to provide you more information and to assist you creating such setups. We also provide help for migrating from other products to Proxmox VE setups. Feel free to contact us at any time for more information.

Migrating VMs from VMware ESXi to Proxmox

In response to Broadcom’s recent alterations in VMware’s subscription model, an increasing number of enterprises are reevaluating their virtualization strategies. With heightened concerns over licensing costs and accessibility to features, businesses are turning towards open source solutions for greater flexibility and cost-effectiveness. Proxmox, in particular, has garnered significant attention as a viable alternative. Renowned for its robust feature set and open architecture, Proxmox offers a compelling platform for organizations seeking to mitigate the impact of proprietary licensing models while retaining comprehensive virtualization capabilities. This trend underscores a broader industry shift towards embracing open-source technologies as viable alternatives in the virtualization landscape. Just to mention, Proxmox is widely known as a viable alternative to VMware ESXi but there are also other options available, such as bhyve which we also covered in one of our blog posts.

Benefits of Opensource Solutions

In the dynamic landscape of modern business, the choice to adopt open source solutions for virtualization presents a strategic advantage for enterprises. With platforms like KVM, Xen and even LXC containers, organizations can capitalize on the absence of license fees, unlocking significant cost savings and redirecting resources towards innovation and growth. This financial flexibility empowers companies to make strategic investments in their IT infrastructure without the burden of proprietary licensing costs. Moreover, open source virtualization promotes collaboration and transparency, allowing businesses to tailor their environments to suit their unique needs and seamlessly integrate with existing systems. Through community-driven development and robust support networks, enterprises gain access to a wealth of expertise and resources, ensuring the reliability, security, and scalability of their virtualized infrastructure. Embracing open source virtualization not only delivers tangible financial benefits but also equips organizations with the agility and adaptability needed to thrive in an ever-evolving digital landscape.

Migrating a VM

Prerequisites

To ensure a smooth migration process from VMware ESXi to Proxmox, several key steps must be taken. First, SSH access must be enabled on both the VMware ESXi host and the Proxmox host, allowing for remote management and administration. Additionally, it’s crucial to have access to both systems, facilitating the migration process. Furthermore, establishing SSH connectivity between VMware ESXi and Proxmox is essential for seamless communication between the two platforms. This ensures efficient data transfer and management during migration. Moreover, it’s imperative to configure the Proxmox system or cluster in a manner similar to the ESXi setup, especially concerning networking configurations. This includes ensuring compatibility with VLANs or VXLANs for more complex setups. Additionally, both systems should either run on local storage or have access to shared storage, such as NFS, to facilitate the transfer of virtual machine data. Lastly, before initiating the migration, it’s essential to verify that the Proxmox system has sufficient available space to accommodate the imported virtual machine, ensuring a successful transition without storage constraints.

Activate SSH on ESXi

The SSH server must be activated in order to copy the content from the ESXi system to the new location on the Proxmox server. The virtual machine will later be copied from the Proxmox server. Therefore, it is necessary that the Proxmox system can establish an SSH connection on tcp/22 to the ESXi system:

Log in to the VMware ESXi host.
Navigate to Configuration > Security Profile.
Enable SSH under Services.

Find Source Information about VM on ESXi

One of the challenging matters in finding the location of the virtual machine holding the virtual machine disk. The path can be found within the web UI of the ESXi system:

Locate the ESXi node that runs the Virtual Machine that should be migrated
Identify the virtual machine to be migrated (e.g., pgsql07.gyptazy.ch).
Obtain the location of the virtual disk (VMDK) associated with the VM from the configuration panel.
The VM location path should be shown (e.g., /vmfs/volumes/137b4261-68e88bae-0000-000000000000/pgsql07.gyptazy.ch).
Stop and shutdown the VM.

Create a New Empty VM on Proxmox

Create a new empty VM in Proxmox.
Assign the same resources like in the ESXi setup.
Set the network type to VMware vmxnet3.
- Ensure the needed network resources (e.g., VLAN, VXLAN) are properly configured.
Set the SCSCI controller for the disk to VMware PVSCSI.
- Do not create a new disk (this will be imported later from the ESXi source).
Each VM gets an ID assigned by Proxmox (note it down, it will be needed later).

Copy VM from ESXi to Proxmox

The content of the virtual machine (VM) will be transferred from the ESXi to the Proxmox system using the open source tool rsync for efficient synchronization and copying. Therefore, the following commands need to be executed from the Proxmox system, where we create a temporary directory to store the VM’s content:

mkdir /tmp/migration_pgsql07.gyptazy.ch
cd /tmp/migration_pgsql07.gyptazy.ch
rsync -avP root@esx02-test.gyptazy.ch:/vmfs/volumes/137b4261-68e88bae-0000-000000000000/pgsq07.gyptazy.ch/* .

Depending on the file size of them virtual machine and the network connectivity this process may take some time.

Import VM in Proxmox

Afterwards, the disk is imported using the qm utility, defining the VM ID (which got created during the VM creation process), along with specifying the disk name (which has been copied over) and the destination data storage on the Proxmox system where the VM disk should be stored:

qm disk import 119 pgsql07.gyptazy.ch.vmdk local-lvm

Depending on the creation format of the VM or the exporting format there may be multiple disk files which may also be suffixed by _flat. This procedure needs to be repeated by all available disks.

Starting the VM

In the final step, all settings, resources, definitions and customizations of the system should be thoroughly reviewed. One validated, the VM can be launched, ensuring that all components are correctly configured for operation within the Proxmox environment.

Conclusion

This article only covers one of many possible methods for migrations in simple, standalone setups. In more complex environments involving multiple host nodes and different storage systems like fibre channel or network storage, there are significant differences and additional considerations. Additionally, there may be specific requirements regarding availability and Service Level Agreements (SLAs) to be concern. This may be very specific for each environment. Feel free to contact us for personalized guidance on your specific migration needs at any time. We are also pleased to offer our support in related areas in open source such as virtualization (e.g., OpenStack, VirtualBox) and topics pertaining to cloud migrations.

Addendum

On the 27th of March, Proxmox released their new import wizard (pve-esxi-import-tools) which makes migrations from VMware ESXi instances to a Proxmox environment much easier. Within an upcoming blog post we will provide more information about the new tooling and cases where this might be more useful but also covering the corner cases where the new import wizard cannot be used.

SQLreduce: Reduce verbose SQL queries to minimal examples

Developers often face very large SQL queries that raise some errors. SQLreduce is a tool to reduce that complexity to a minimal query.

SQLsmith generates random SQL queries

SQLsmith is a tool that generates random SQL queries and runs them against a PostgreSQL server (and other DBMS types). The idea is that by fuzz-testing the query parser and executor, corner-case bugs can be found that would otherwise go unnoticed in manual testing or with the fixed set of test cases in PostgreSQL’s regression test suite. It has proven to be an effective tool with over 100 bugs found in different areas in the PostgreSQL server and other products since 2015, including security bugs, ranging from executor bugs to segfaults in type and index method implementations. For example, in 2018, SQLsmith found that the following query triggered a segfault in PostgreSQL:

select
  case when pg_catalog.lastval() < pg_catalog.pg_stat_get_bgwriter_maxwritten_clean() then case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       else case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       end as c0,
  case when (select intervalcol from public.brintest limit 1 offset 1)
         >= cast(null as "interval") then case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       else case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       end as c1,
  ref_0.a as c2,
  (select a from public.idxpart1 limit 1 offset 5) as c3,
  ref_0.b as c4,
    pg_catalog.stddev(
      cast((select pg_catalog.sum(float4col) from public.brintest)
         as float4)) over (partition by ref_0.a,ref_0.b,ref_0.c order by ref_0.b) as c5,
  cast(nullif(ref_0.b, ref_0.a) as int4) as c6, ref_0.b as c7, ref_0.c as c8
from
  public.mlparted3 as ref_0
where true;

However, just like in this 40-line, 2.2kB example, the random queries generated by SQLsmith that trigger some error are most often very large and contain a lot of noise that does not contribute to the error. So far, manual inspection of the query and tedious editing was required to reduce the example to a minimal reproducer that developers can use to fix the problem.

Reduce complexity with SQLreduce

This issue is solved by SQLreduce. SQLreduce takes as input an arbitrary SQL query which is then run against a PostgreSQL server. Various simplification steps are applied, checking after each step that the simplified query still triggers the same error from PostgreSQL. The end result is a SQL query with minimal complexity.

SQLreduce is effective at reducing the queries from original error reports from SQLsmith to queries that match manually-reduced queries. For example, SQLreduce can effectively reduce the above monster query to just this:

SELECT pg_catalog.stddev(NULL) OVER () AS c5 FROM public.mlparted3 AS ref_0

Note that SQLreduce does not try to derive a query that is semantically identical to the original, or produces the same query result – the input is assumed to be faulty, and we are looking for the minimal query that produces the same error message from PostgreSQL when run against a database. If the input query happens to produce no error, the minimal query output by SQLreduce will just be SELECT.

How it works

We’ll use a simpler query to demonstrate how SQLreduce works and which steps are taken to remove noise from the input. The query is bogus and contains a bit of clutter that we want to remove:

$ psql -c 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'
ERROR:  column pg_database.reltuples does not exist

Let’s pass the query to SQLreduce:

$ sqlreduce 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'

SQLreduce starts by parsing the input using pglast and libpg_query which expose the original PostgreSQL parser as a library with Python bindings. The result is a parse tree that is the basis for the next steps. The parse tree looks like this:

selectStmt
├── targetList
│   └── /
│       ├── pg_database.reltuples
│       └── 1000
├── fromClause
│   ├── pg_database
│   └── pg_class
├── whereClause
│   └── <
│       ├── 0
│       └── /
│           ├── pg_database.reltuples
│           └── 1000
├── orderClause
│   └── 1
└── limitCount
    └── 10

Pglast also contains a query renderer that can render back the parse tree as SQL, shown as the regenerated query below. The input query is run against PostgreSQL to determine the result, in this case ERROR: column pg_database.reltuples does not exist.

Input query: select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10
Regenerated: SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC LIMIT 10
Query returns: ✔ ERROR:  column pg_database.reltuples does not exist

SQLreduce works by deriving new parse trees that are structurally simpler, generating SQL from that, and run these queries against the database. The first simplification steps work on the top level node, where SQLreduce tries to remove whole subtrees to quickly find a result. The first reduction tried is to remove LIMIT 10:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC ✔

The query result is still ERROR: column pg_database.reltuples does not exist, indicated by a ✔ check mark. Next, ORDER BY 1 is removed, again successfully:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Now the entire target list is removed:

SELECT FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

This shorter query is still equivalent to the original regarding the error message returned when it is run against the database. Now the first unsuccessful reduction step is tried, removing the entire FROM clause:

SELECT WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"

That query is also faulty, but triggers a different error message, so the previous parse tree is kept for the next steps. Again a whole subtree is removed, now the WHERE clause:

SELECT FROM pg_database, pg_class ✘ no error

We have now reduced the input query so much that it doesn’t error out any more. The previous parse tree is still kept which now looks like this:

selectStmt
├── fromClause
│   ├── pg_database
│   └── pg_class
└── whereClause
    └── <
        ├── 0
        └── /
            ├── pg_database.reltuples
            └── 1000

Now SQLreduce starts digging into the tree. There are several entries in the FROM clause, so it tries to shorten the list. First, pg_database is removed, but that doesn’t work, so pg_class is removed:

SELECT FROM pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"
SELECT FROM pg_database WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Since we have found a new minimal query, recursion restarts at top-level with another try to remove the WHERE clause. Since that doesn’t work, it tries to replace the expression with NULL, but that doesn’t work either.

SELECT FROM pg_database ✘ no error
SELECT FROM pg_database WHERE NULL ✘ no error

Now a new kind of step is tried: expression pull-up. We descend into WHERE clause, where we replace A < B first by A and then by B.

SELECT FROM pg_database WHERE 0 ✘ ERROR:  argument of WHERE must be type boolean, not type integer
SELECT FROM pg_database WHERE pg_database.reltuples / 1000 ✔
SELECT WHERE pg_database.reltuples / 1000 ✘ ERROR:  missing FROM-clause entry for table "pg_database"

The first try did not work, but the second one did. Since we simplified the query, we restart at top-level to check if the FROM clause can be removed, but it is still required.

From A / B, we can again pull up A:

SELECT FROM pg_database WHERE pg_database.reltuples ✔
SELECT WHERE pg_database.reltuples ✘ ERROR:  missing FROM-clause entry for table "pg_database"

SQLreduce has found the minimal query that still raises ERROR: column pg_database.reltuples does not exist with this parse tree:

selectStmt
├── fromClause
│   └── pg_database
└── whereClause
    └── pg_database.reltuples

At the end of the run, the query is printed along with some statistics:

Minimal query yielding the same error:
SELECT FROM pg_database WHERE pg_database.reltuples

Pretty-printed minimal query:
SELECT
FROM pg_database
WHERE pg_database.reltuples

Seen: 15 items, 915 Bytes
Iterations: 19
Runtime: 0.107 s, 139.7 q/s

This minimal query can now be inspected to fix the bug in PostgreSQL or in the application.

About credativ

The credativ GmbH is a manufacturer-independent consulting and service company located in Moenchengladbach, Germany. With over 22+ years of development and service experience in the open source space, credativ GmbH can assist you with unparalleled and individually customizable support. We are here to help and assist you in all your open source infrastructure needs.

Since the successful merger with Instaclustr in 2021, credativ GmbH has been the European headquarters of the Instaclustr Group, which helps organizations deliver applications at scale through its managed platform for open source technologies such as Apache Cassandra®, Apache Kafka®, Apache Spark™, Redis™, OpenSearch®, PostgreSQL®, and Cadence.
Instaclustr combines a complete data infrastructure environment with hands-on technology expertise to ensure ongoing performance and optimization. By removing the infrastructure complexity, we enable companies to focus internal development and operational resources on building cutting edge customer-facing applications at lower cost. Instaclustr customers include some of the largest and most innovative Fortune 500 companies.

Patroni is a clustering solution for PostgreSQL^® that is getting more and more popular in the cloud and Kubernetes sector due to its operator pattern and integration with Etcd or Consul. Some time ago we wrote a blog post about the integration of Patroni into Debian. Recently, the vip-manager project which is closely related to Patroni has been uploaded to Debian by us. We will present vip-manager and how we integrated it into Debian in the following.

To recap, Patroni uses a distributed consensus store (DCS) for leader-election and failover. The current cluster leader periodically updates its leader-key in the DCS. As soon the key cannot be updated by Patroni for whatever reason it becomes stale. A new leader election is then initiated among the remaining cluster nodes.

PostgreSQL Client-Solutions for High-Availability

From the user’s point of view it needs to be ensured that the application is always connected to the leader, as no write transactions are possible on the read-only standbys. Conventional high-availability solutions like Pacemaker utilize virtual IPs (VIPs) that are moved to the primary node in the case of a failover.

For Patroni, such a mechanism did not exist so far. Usually, HAProxy (or a similar solution) is used which does periodic health-checks on each node’s Patroni REST-API and routes the client requests to the current leader.

An alternative is client-based failover (which is available since PostgreSQL 10), where all cluster members are configured in the client connection string. After a connection failure the client tries each remaining cluster member in turn until it reaches a new primary.

vip-manager

A new and comfortable approach to client failover is vip-manager. It is a service written in Go that gets started on all cluster nodes and connects to the DCS. If the local node owns the leader-key, vip-manager starts the configured VIP. In case of a failover, vip-manager removes the VIP on the old leader and the corresponding service on the new leader starts it there. The clients are configured for the VIP and will always connect to the cluster leader.

Debian-Integration of vip-manager

For Debian, the pg_createconfig_patroni program from the Patroni package has been adapted so that it can now create a vip-manager configuration:

pg_createconfig_patroni 11 test --vip=10.0.3.2

Similar to Patroni, we start the service for each instance:

systemctl start vip-manager@11-test

The output of patronictl shows that pg1 is the leader:

+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

In journal of ‘pg1’ it can be seen that the VIP has been configured:

Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is false, desired true
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 Configuring address 10.0.3.2/24 on eth0
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is true, desired true

If LXC containers are used, one can also see the VIP in the output of lxc-ls -f:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.2, 10.0.3.247 -    false
pg2     RUNNING 0         -      10.0.3.94            -    false
pg3     RUNNING 0         -      10.0.3.214           -    false

The vip-manager packages are available for Debian testing (bullseye) and unstable, as well as for the upcoming 20.04 LTS Ubuntu release (focal) in the official repositories. For Debian stable (buster), as well as for Ubuntu 19.04 and 19.10, packages are available at apt.postgresql.org maintained by credativ, along with the updated Patroni packages with vip-manager integration.

Switchover Behaviour

In case of a planned switchover, e.g. pg2 becomes the new leader:

# patronictl -c /etc/patroni/11-test.yml switchover --master pg1 --candidate pg2 --force
Current cluster topology
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+
2020-01-19 15:35:32.52642 Successfully switched over to "pg2"
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 |        | stopped |    |   unknown |
| 11-test |  pg2   | 10.0.3.94  | Leader | running |  1 |           |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

The VIP has now been moved to the new leader:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.247          -    false
pg2     RUNNING 0         -      10.0.3.2, 10.0.3.94 -    false
pg3     RUNNING 0         -      10.0.3.214          -    false

This can also be seen in the journals, both from the old leader:

Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,634 INFO: manual failover: demoting myself
Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,854 INFO: Leader key released
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is true, desired false
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 Removing address 10.0.3.2/24 on eth0
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is false, desired false

As well as from the new leader pg2:

Jan 19 15:35:31 pg2 patroni[9229]: 2020-01-19 15:35:31,881 INFO: promoted self to leader by acquiring session lock
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is false, desired true
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 Configuring address 10.0.3.2/24 on eth0
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is true, desired true
Jan 19 15:35:32 pg2 patroni[9229]: 2020-01-19 15:35:32,923 INFO: Lock owner: pg2; I am pg2

As one can see, the VIP is moved within one second.

Updated Ansible Playbook

Our Ansible-Playbook for the automated setup of a three-node cluster on Debian has also been updated and can now configure a VIP if so desired:

# ansible-playbook -i inventory -e vip=10.0.3.2 patroni.yml

Questions and Help

Do you have any questions or need help? Feel free to write to info@credativ.com.

There are two ways to authenticate yourself as a client to Icinga2. On the one hand there is the possibility to authenticate yourself by username and password. The other option is authentication using client certificates. With the automated query of the Icinga2 API, the setup of client certificates is not only safety-technically advantageous, but also in the implementation on the client side much more practical.

Unfortunately, the official Icinga2 documentation does not provide a description of the exact certificate creation process. Therefore here is a short manual:

After installing Icinga2 the API feature has to be activated first:

icinga2 feature enable api

The next step is to configure the Icinga2-node as master, the easiest way to do this is with the “node-wizard” program:

icinga2 node wizard

Icinga2 creates the necessary CA certificates with which the client certificates still to be created must be signed. Now the client certificate is created:

icinga2 pki new-cert --cn  --key .key --csr .csr

The parameter cn stands for the so-called common-name. This is the name used in the Icinga2 user configuration to assign the user certificate to the user. Usually the common name is the FQDN. In this scenario, however, this name is freely selectable. All other names can also be freely chosen, but it is recommended to use a name that suggests that the three files belong together.

Now the certificate has to be signed by the CA, Icinga2:

icinga2 pki sign-csr --csr .csr --cert .crt

Finally, the API user must be created in the file “api-user.conf”. This file is located in the subfolder of each Icinga2 configuration:

object ApiUser {
client_cn = 
permissions = []
}

For a detailed explanation of the user’s assignment of rights, it is worth taking a look at the documentation.

Last but not least Icinga2 has to be restarted. Then the user can access the Icinga2 API without entering a username and password, if he passes the certificates during the query.

You can read up on the services we provide for Icinga2 righthere.

This post was originally written by Bernd Borowski.

Efficient Proxmox Cluster Deployment through Automation with Ansible

Benefits

Ansible Proxmox Module: proxmox_cluster

Creating a Cluster

Joining a Cluster

Cluster Join Informationen

Conclusion

Introduction

Preparations

Low-level prototype

Basic functionality

Validating input

Content Checksums

Improving memory footprint

Ensure it the Puppet way

Final low-level prototype

Resource-API prototype

Basic functionality

Canonicalize

Thoughts

Introduction

Overview of possible migration approaches

Setup of Valkey

Start the server and establish a client connection:

Automated data migration to Valkey

Conclusion

Mastering Cloud Infrastructure with Pulumi: Introduction

Pulumi – The Theory

Why Pulumi?

Challenges with Pulumi

State Management in Pulumi: Ensuring Consistency Across Deployments

Other IaC Tools: Comparing Pulumi to Traditional IaC Tools

Pulumi – In Practice

Introduction

Prerequisites

Project Structure

Overview of a Typical Pulumi Project Directory

NetApp FSx on AWS

Introduction

Key Features

What It’s About

Putting It All Together

Architecture Overview

Setting up Pulumi

Example: VPC, Subnets, and FSx for NetApp ONTAP

Pulumi Code Example: VPC, Subnets, and FSx for NetApp ONTAP

Pytest with Pulumi

Introduction

Testing Code

Github Actions

Introduction

Why Use GitHub Actions and Its Importance

Execution

Example: Deploy infrastructure with Pulumi

Output:

Outlook: Exploring Advanced Features of Pulumi and Amazon FSx for NetApp ONTAP

Advanced Features of Pulumi

Advanced Features of Amazon FSx for NetApp ONTAP

What’s Next?

Conclusion

Disclaimer

Suggested Links

Veeam & Proxmox VE

Opportunities for Open-Source Virtualization

Integration into Veeam

Overview

Requirements

Configuration

Adding a Proxmox Server

Usage

Creating Backup Jobs

Restoring an entire VM

File-Level-Restore

Conclusion

What is Local Storage Replication?

Benefits

Caveat

Setting Up Local Storage Replication in Proxmox

Step 1: Prepare Your Environment

Step 2: Configure Storage Replication