Why Automate? Reliability Approaches with the VMware NSX-T API

Nick Schmidt
7 min readDec 1, 2021

I’m sure most IT workers have at least heard of REST APIs, or heard a sales pitch where a vendor insists that while a requested functionality doesn’t exist, you could build it yourself by “using the API”.

Or, participate in discussions where people seemed to try and offer you a copy of The DevOps Handbook or The Unicorn Project.

They’re right, but software development and deployment methods have completely different guiding values than infrastructure management. Speed of delivery is almost completely worthless with infrastructure, where downtime is typically the only metric that infrastructure is evaluated on.

We need to transform the industry.

The industry has proven that “value-added infrastructure” is a thing that people want, otherwise, services like Amazon AWS, Azure, Lumen would not be profitable. Our biggest barrier to success right now is the perceptions around reliability because there clearly is demand for what we’d call abstraction of infrastructure. We can’t move as slow as we used to, but we can’t make mistakes either.

Stuck between a rock and a hard place?

I have some good news — everybody’s just figuring this out as they go, and you don’t have to start by replacing all of your day-to-day tasks with Ansible playbooks. Let’s use automation tools to ensure Quality First, Speed Second. Machines excel at comparison operators, allowing an infrastructure administrator to test every possible aspect of infrastructure when executing a change. Here are some examples where I’ve personally seen a need for automation:

  • Large-scale routing changes: if 1,000 routes successfully migrate, and a handful of routes fail, manual checks tend to depend overly (unfairly) on the operator to eyeball the entire lot
  • Check: Before and after routes, export a difference
  • Check: All dynamic routing peers, export a difference
  • Reverse the process if anything fails
  • Certificate renewals
  • Check: If certificate exists
  • Check: If the certificate was uploaded
  • Check: If the certificate has a valid CA chain
  • Check: If the certificate was successfully installed
  • Reverse the process if anything fails
  • Adding a new VLAN or VNI to a fabric
  • Check: VLAN Spanning-Tree topology, export a difference
  • Check: EVPN AFI Peers, export a difference
  • Check: MAC Address Table, export a difference
  • Reverse the process if anything fails

The neat thing about this capability is the configuration reversal — API calls are incredibly easy to process in common programming languages (particularly compared to expect) and take fractions of a second to run — so if a tested process (it’s easy to test, too!) does fail, reversion is straightforward. Let’s cover the REST methods before exploring the deeper stuff like gNMI or YANG.

When implementing a REST API call, a client request will have several key components:

  • Headers: Important meta-data about your request go here, the server should adhere to any specification provided in HTTP headers. If you’re building API code or otherwise, I’d recommend just setting up a standard when reviewing the list of supported fields. Examples:
  • Resource: This is specified by the Uniform Resource Indicator, the URL component after the system is specified. A resource is the “what” of a RESTful interaction.
  • Body: Free-form optional text, this component provides a payload for the API call. It’s important to make sure that the server actually wants it!
  • Web Application Firewalls (WAF) can inspect header, verb, and body to determine if an API call is safe and proper.

Verb

In a REST API, it’s important to specify the TYPE of change you intend to make prior to actually invoking it. F5 Administrators will be familiar with this, with actions like tmsh create. We have 4 major REST verbs:

When you use a particular transport, you need to implement these verbs in a method native to that transport. This is significant when using other remote command methods like SSH (tmsh does this) or NetCONF or RESTCONF, all of which need a different method to implement.

Fortunately for us, HTTP 1.1 seems like it’s been made for this! HTTP has plenty of verbs that match the above, here’s a brief decoder ring.

  • GET: READ-only request, typically does not include a message body.
  • This will normally use a URI to specify what details you want to grab.
  • Since you’re “getting” information here, typically you’d want to JSON pretty-print the output
  • POST: CREATE request, if you’re making a new object on a remote system a message body is typically required and POST conveniently supports that.
  • POST: READ request, occasionally used when a query requires a message body.
  • URIs don’t always cut it when it comes to remote filtered requests or complex multi-tier queries.
  • Cisco NX-API avoids GET as a READ verb, and primarily uses POST instead with the REST verbs in the body
  • PUT: UPDATE request, is idempotent. Generally does not contain a lot of change safety, as it will implement or fully replace an object.
  • Situations definitely exist that you want to be idempotent, and this is the verb for that.
  • Doesn’t require a body
  • PATCH: MODIFY request, will only modify an existing object.
  • This will take considerably more work to structure, as PATCH can optionally be safely executed, but the responsibility for assembling requests safely in this manner is on the developer.
  • Most API implementations simply use POST instead and implement change safety in the back-end.
  • DELETE: DELETE request, does exactly what it sounds like, it makes a resource disappear.

Once the rules are set, the execution of a REST call is extremely easy, here’s an example:

curl -k -u admin https://nsx.lab.engyak.net/api/v1/alarms
Enter host password for user ‘admin’:
{
“results” : [ {
“id” : “3e79618a-c89e-477b-8872-f4c87120585b”,
“feature_name” : “certificates”,
“event_type” : “certificate_expiration_approaching”,
“feature_display_name” : “Certificates”,
“event_type_display_name” : “Certificate Expiration Approaching”,
“summary” : “A certificate is approaching expiration.”,
“description” : “Certificate 5c9565d8–2cfa-4a28–86cc-e095acba5ba2 is approaching expiration.”,
“recommended_action” : “Ensure services that are currently using the certificate are updated to use a new, non-expiring certificate. For example, to apply a new certificate to the HTTP service, invoke the
following NSX API POST /api/v1/node/services/http?action=apply_certificate&certificate_id=<cert-id> where <cert-id> is the ID of a valid certificate reported by the GET /api/v1/trust-management/certificates NS
X API. Once the expiring certificate is no longer in use, it should be deleted by invoking the DELETE /api/v1/trust-management/certificates/5c9565d8–2cfa-4a28–86cc-e095acba5ba2 NSX API.”,
“node_id” : “37e90542-f8b8–136e-59bc-5dd3b79b122b”,
“node_resource_type” : “ClusterNodeConfig”,
“entity_id” : “5c9565d8–2cfa-4a28–86cc-e095acba5ba2”,
“last_reported_time” : 1637510695463,
“status” : “OPEN”,
“severity” : “MEDIUM”,
“node_display_name” : “nsx”,
“node_ip_addresses” : [ “10.66.0.204” ],
“reoccurrences_while_suppressed” : 0,
“entity_resource_type” : “certificate_self_signed”,
“alarm_source_type” : “ENTITY_ID”,
“alarm_source” : [ “5c9565d8–2cfa-4a28–86cc-e095acba5ba2” ],
“resource_type” : “Alarm”,
“display_name” : “3e79618a-c89e-477b-8872-f4c87120585b”,
“_create_user” : “system”,
“_create_time” : 1635035211215,
“_last_modified_user” : “system”,
“_last_modified_time” : 1637510695464,
“_system_owned” : false,
“_protection” : “NOT_PROTECTED”,
“_revision” : 353
}

Now — saving the cURL commands can be very administratively intensive — So I recommend some form of method to save and automate custom API calls. Quite a few more complex calls will require JSON payloads, variables, stuff like that.

Planning the Procedure

Here we’ll use the API to resolve the following alarm. I’m going to use my own REST client, found here, because it’s familiar. Let’s write the desired result in pseudo-code first to develop a plan:

  • GET current cluster certificate ID
  • GET certificate store
  • PUT a replacement certificate with a new name
  • GET certificate store (validate PUT)
  • GET certificate ID (to further validate PUT). For idempotency, multiple runs should be supported.
  • POST update cluster certificate
  • GET current cluster certificate ID

Let’s Trick Those Rocks

Some general guidelines when scripting API calls:

  • Use a familiar language. An infrastructure engineer’s goal with automation is reliability. Hiring trends, hipster cred, don’t matter here. If you do best with a slide rule, use that.
  • Use libraries. An infrastructure engineer’s goal with automation is reliability. Leverage libraries with publicly available testing results.
  • Log and Report: An infrastructure engineer’s goal with automation is reliability. Report every little thing your code does to your infrastructure, and test code thoroughly.

From here, it’s important to research the API calls required for this procedure (good thing we have the steps!). For NSX-T, the API Documentation is available here: https://developer.vmware.com/apis/1163/nsx-t

Since we’re writing code for reliability

I’d like to outline a rough idea of where my time investment was for this procedure. I hope it helps because the focus really isn’t on writing code.

  • 50%: Testing and planning testing. I used Jenkins CI for this, and I’m not the most capable with it. This effort reduces over time, but does not reduce importance! Write your test cases before everything!
  • 30%: Research. Consulting the VMware API docs and official documentation was worth every yoctosecond — avoiding potential problems with planned work is critical (and there were some major caveats with the API implementation).
  • 10%: Updating the parent library, setting up the python environment. Most of this work is 100% re-usable.
  • 5%: Managing source code, Git branching, basically generating a bread-crumb trail for the implementation for when I don’t remember it.
  • 5%: Actually writing code!
# JSON Parsing tool import json # Import Restify Library from restify.RuminatingCogitation import Reliquary # Import OS - let's use this for passwords and usernames # APIUSER = Username # APIPASS = Password import os api_user = os.getenv("APIUSER") api_pass = os.getenv("APIPASS") # Set the interface - apply from variables no matter what cogitation_interface = Reliquary( "settings.json", input_user=api_user, input_pass=api_pass ) # Build Results Dictionary stack = { "old_cluster_certificate_id": False, "old_certificate_list": [], "upload_result": False, "new_certificate_id": False, "new_certificate_list": [], "new_cluster_certificate_id": False, } # GET current cluster certificate ID stack["old_cluster_certificate_id"] = json.loads( cogitation_interface.namshub("get_cluster_certificate_id") )["certificate_id"] # GET certificate store for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[ "results" ]: stack["old_certificate_list"].append(i["id"]) # We need to compare lists, so let's sort it first stack["old_certificate_list"].sort() # PUT a replacement certificate with a new name print(cogitation_interface.namshub("put_certificate", namshub_variables="cert.json")) # GET certificate store (validate PUT) for i in json.loads(cogitation_interface.namshub("get_cluster_certificates"))[ "results" ]: stack["new_certificate_list"].append(i["id"]) # We need to compare lists, so let's sort it first, then make it the difference between new and old stack["old_certificate_list"].sort() stack["new_certificate_list"] = list( set(stack["new_certificate_list"]) - set(stack["old_certificate_list"]) ) # Be Idempotent - this may be run multiple times, and should handle it accordingly. if len(stack["new_certificate_list"]) == 0: stack["new_certificate_id"] = input( "Change not detected! Please select a certificate to replace with: " ) else: stack["new_certificate_id"] = stack["new_certificate_list"][0] # GET certificate ID (to further validate PUT) print( cogitation_interface.namshub( "get_cluster_certificate", namshub_variables=json.dumps({"id": stack["new_certificate_id"]}), ) ) # POST update cluster certificate print( cogitation_interface.namshub( "post_cluster_certificate", namshub_variables=json.dumps({"id": stack["new_certificate_id"]}), ) ) print( cogitation_interface.namshub( "post_webui_certificate", namshub_variables=json.dumps({"id": stack["new_certificate_id"]}), ) ) # GET current cluster certificate ID stack["new_cluster_certificate_id"] = json.loads( cogitation_interface.namshub("get_cluster_certificate_id") )["certificate_id"] # Show the results print(json.dumps(stack, indent=4))

Originally published at https://blog.engyak.co.

--

--

Nick Schmidt

I am a network engineer based out of Alaska, pursuing various methods of achieving SRE/NRE