Prism Central VM has no IP address

    January 15, 2018
   2 min read
    nutanix

Prism Central v5.5.0.2 was recently released so I grabbed the metadata file and the binary to upgrade my existing Prism Central (which was version 5.5). Following the usual steps I take, I downloaded what I needed and uploaded the binary and the metadata file to the Prism Central VM in order to perform the upgrade. Once the required files were uploaded I clicked the Upgrade button and waited.

It was then that I was told my Prism Central VM didn’t have the required minimum amout of RAM to run the new features.

Prism Central Error

No biggie, I’ll just shut down Prism Central and bump the RAM and I’ll be done. I went ahead and shutdown Prism Central, bumped the RAM in the VM and started it back up agian. Waited for a while and tried to access Prism Central but got no where. Ran a ping to the IP address of the Prism Central VM but again, no dice. Checked in vSphere and could see that the NIC was attached so it wasn’t a VMware thing.

I decided to have a look at the Prism Central VM itself to see if I could find out why I couldn’t ping the IP address. Upon running an ifconfg at the command line I only got the loopback address. My eth0 nic wasn’t listed at all.

Prism Central ifconfig

I tried another command to view the IP configuration ip address and saw that the eth0 interface was DOWN.

Right, so that’ll do it. Now to have a look at the interface configuration.

I jumped into the /etc/sysconfig/network-scripts/ifcfg-eth0 file to have a look at the config. As soon as the file was open I saw the problem. The ip address, gateway, netmask were all set correctly so it wasn’t a case of losing the IP details. I did notice though that the ONBOOT option was set to no. This basically means that the interface will be down when then VM is started.

I quickly edited the file and changed the ONBOOT option to yes, saved and exited that file, then ran sudo service network restart to restart the network service. As soon as that service restarted my VM was pingable again and the rest of the Prism Central services started up. I checked ip address and ifconfig commands again and the eth0 interface was showing as up and was working correctly.

Once the services started up the Prism Central upgrade continued immediately.

Prism Central Upgrading

I’m not entirely sure why this happened in the first place but I’m glad that it was a simple fix to get Prism Central up and running again so I could perform the upgrade.






Moving Nutanix to a New Data Center

    November 01, 2017
   4 min read
    nutanix

I recently went through the process of moving our Nutanix cluster to a new Data Center. Usually this sort of activity is what people dread given that traditional 3-tier infrastructure is so large and complex, however seeing as that this is a Nutanix cluster I’m talking about, it was a piece of cake!

There is a couple of things I had to be aware of before beginning the move and there was a few things I ran into post move that I wanted to document for myself and for anyone else that might be undertaking the same activity that runs into similar issues

The Shutdown

Prior to moving the Nutanix cluster I obviously had to shut down all the VM’s currently on the cluster so I made sure that was done and also made sure the CVM’s weren’t touched (left on).

I now needed to stop the cluster cleanly before the nodes were shutdown. With the guest VM’s now off I fired up PUTTY and ssh’d into one of my CVM’s. Gracefully stopping the cluster was as easy as cluster stop.

Now the cluster was stopped it was time to shutdown the CVM’s. It’s apparently okay to shutdown the CVM’s normally through your Hypervisor (in our case ESX) by going to Power –> Shutdown Guest. However I preferred to use the cvm_shutdown -P now. I know it’s essentially the same thing, but that’s the way I decided to do it. Because we have 7 Nodes in our cluster I went on to SSH into the remaining CVM’s and shut them down gracefully.

The next step in the Shutdown process was to power down the hosts. I could have done this one of two ways:

  • ESXi
  • IPMI (or iDRAC, or whatever your out of band management is)

Because I was already logged in to each host via vSphere I decided to use vSphere to shut the nodes down.

Now that the hosts were off the shutdown sequence was complete and I could move on to pulling the power and network cables, pulling the nodes out of the rack and moving them to the new DC.

ready to move
Ready for the new home!

The Startup

After racking the cluster in the new DC, cabling done and IPMI accessible it was time to start the hosts. I logged in to each host over IPMI and started them up, Waited for vSphere to be accessible and made sure the CVM’s were started.

Once all 7 VCM’s were started I could move on to starting the cluster. Thankfully, again, this is a Nutanix cluster so starting the cluster was as simple as cluster start. Once all of the CVM services were started I now had a functioning cluster. Just to be sure I kicked off a NCC on the cluster to verify all was okay. I opened up a PUTTy terminal to a CVM and ran ncc health_checks run_all.

With my cluster now back in an operational state it was time to start my vCenter controller (and External PSC) and make sure the cluster was communicating with vCenter correctly. Once vCenter and the Nutanix CVM’s were talking I began starting my guest VM’s.

The Minor Issue

At the beginning of this article I mentioned that I ran into a minor issue post cluster move. I say minor because my cluster was functioning correctly however I didn’t want the issue to go unnoticed. The issue was that the Curator Service hadn’t run in the past 24 hours (cluster was offline for 24 hours while the networking guys fixed an issue with cross connects…) so it was generating some critical alerts. I won’t go into that issue and what I did to fix it here as I wanted this to be a post about the cluster move. So if you’re interested in how I fixed that Curator alert, check this post.

The End

Usually, I dread moving Data Centers. This time, it was so simple and so fast I honestly wouldn’t mind doing it again. Nutanix made the whole process super streamlined and had there not been Networking issues between the Data Centers, I would have had the cluster moved and back online within 6 hours. As with anything a little bit of pre-planning goes a long way. Because I knew the process to shutdown and startup the Nutanix cluster cleanly prior to undertaking the work I was prepared and was able to do this move in a very short amount of time (network permitting…). If you are undertaking a Data Center move and have a Nutanix cluster that you need to move, familiarize yourself with the process. It’ll save you time as well as ensure you understand what needs to be done.

At the end of it all, Nutanix made my life easier.






Curator Scan Status Failing in Prism

    November 01, 2017
   3 min read
    nutanix

I recently moved our Nutanix Cluster to a new Data Center. After I completed the move and brought the Nutanix cluster back online Prism was generating a Critical alert basically telling me that the Curator Scan hadn’t run in the last 24 hours. Here’s the exact alert:

Curator Service Error

I wasn’t too concerned with this alert because I knew the reason this alert was triggered was because my cluster was off while we moved it. I assumed that in time, Curator would run a partial scan (or full scan) again and the alert would go away. However because I wanted to make sure everything was okay (and to get that reassuring green heart ), I did a little digging into the issue.

Before doing anything I logged this as an issue with Nutanix Support. So while I waited for support to get back to me I figured I’d do some investigating.

I figured that in order to get this alert resolved immediately I’d need to start a Curator Scan manually. So began my search on how to do that. In my search I stumbled across a post from a few years back on the Nutanix Community Forums where someone was asking exactly what I needed to know, could a manual Curator Scan be initiated from CLI. From reading through the post I found that I needed to open a browser to http://{Curator-Master-CVM-IP}:2010/master/control. That was all well and good but I had no idea which CVM was the Curator Master.

So I did a little more digging and found that if I open an SSH session to a CVM and enter links http:0:2010 it’ll bring up an ELinks page which tells me what CVM is the Curator Master. Perfect!

eLinks Page

I now had the first piece of the puzzle, the Curator Master CVM. So now I try to open a web page to http://{Curator-Master-CVM-IP}:2010/master/control and… doesn’t work. I remembered reading a while ago that you could access the Stagate page of a CVM on port 2009 but in order to do that you had to either stop the service or modify iptables on the CVM to allow the connection rather than reject it. So I thought I’d give it a shot.

I SSH’d to my Curator Master CVM and tried: sudo su - iptables -A WORLDLIST -p tcp -m tcp --dport 2010 -j ACCEPT Where 2010 is the Curator port that I needed to open.

That accepted fine and when I tried to access the Curator URL again I was greeted with the below webpage.

Curator Control Page

I was finally getting somewhere. Next, I went ahead and kicked off a partial scan. Once I clicked on the link to ‘Start Partial Scan’ it went to a blank page. I assumed that worked?

I decided to try accessing the root URL of Curator (without the /master/control) and was greeted with a page similar to the ELinks page from above but this time in my browser where I could see the status of the scan I had kicked off!

Curator Active Jobs

I waited a while (849 seconds to be exact) and refreshed that page again and noticed that my scan had complete!

Curator Jobs Succeeded

Now that the Curator scan was complete I checked Prism again, and the alert was gone. I updated the support ticket and Paul from Nutanix Support gave me a call to do one more health check across my cluster anyway.

Again, I know I didn’t have to do this manually but for my own learning I decided to give it a go.






Moving a Nutanix Block

    August 01, 2017
   1 min read
    nutanix

We recently went through an exercise to move our Nutanix blocks to a new Data center. I wanted to document this process for myself as it is very different to shutting down your regular pizza box server and moving the kit. There is a little more to think about and do prior to moving the kit when you are working with Nutanix.

Here is the process I followed which worked well for me.

  1. Shut down any guest VM’s on the Nodes
  2. Stop the cluster (process described here)
  3. Shut down all nodes in the cluster (in our case these were VMWare hypervisors)
  4. Power off the blocks (IPMI or power switch)
  5. Once the blocks are powered down you can safely unplug all cables and unrack/rack your blocks.
  6. When you’ve got all the blocks racked and cabled power them on. The CVM’s will start automatically.
  7. Once all CVMs are online again, SSH into one of them and run: cluster start

Wait for all services to report that they have started:

CVM: xxx.xxx.xxx.xxx UP

     Zeus UP [3704, 3727, 3728, 3729, 3807, 3821] 
     Scavenger UP [4937, 4960, 4961, 4990] 
     SSLTerminator UP [5034, 5056, 5057, 5139] 
     Hyperint UP [5059, 5082, 5083, 5086, 5099, 5108] 
     Medusa UP [5534, 5559, 5560, 5563, 5752] 
     DynamicRingChanger UP [5852, 5874, 5875, 5954] 
     Pithos UP [5877, 5899, 5900, 5962] 
     Stargate UP [5902, 5927, 5928, 6103, 6108] 
     Cerebro UP [5930, 5952, 5953, 6106]
     Chronos UP [5960, 6004, 6006, 6075] 
     Curator UP [5987, 6017, 6018, 6261] 
     Prism UP [6020, 6042, 6043, 6111, 6818] 
     CIM UP [6045, 6067, 6068, 6101] 
     AlertManager UP [6070, 6099, 6100, 6296] 
     Arithmos UP [6107, 6175, 6176, 6344] 
     SysStatCollector UP [6196, 6259, 6260, 6497] 
     Tunnel UP [6263, 6312, 6313] 
     ClusterHealth UP [6317, 6342, 6343, 6446, 6468, 6469, 6604, 6605, 6606, 6607] 
     Janus UP [6365, 6444, 6445, 6584] 
     NutanixGuestTools UP [6377, 6403, 6404]

Now that the Cluster is running you can go ahead and start your Guest VMs.






Restarting a CVM

    July 25, 2017
   1 min read
    nutanix

In the world of Nutanix, Controller VMs (CVMs) are king. They are key to the whole solution. So when the comes in which you need to restart a node or a CVM you should probably take a little care and don’t properly. You don’t want to have it all go all banana on you and then leave you with a broken CVM.

Its probably best to point out here that before you do ANYTHING, call Support. They are there for a reason and are very good at their job.

However, if you want to do this yourself, read on.

Before you reboot the CVM you need to stop it gracefully. Stopping the CVM gracefully allows for all services to stop and, in the event this CVM is the leader, have the cluster elect a new leader. This will ensure you have no issues with your cluster when you reboot the CVM.

So, go ahead and SSH (or open a console) to your CVM. Login and get to the ncli.

Now all you need to do is: cvm_shutdown -P now

That’s it.

The cvm_shutdown -P now command will gracefully stop all services on the CVM allowing you to reboot the CVM (or the node if you need) cleanly.

Once your CVM is back up you can initiate NCC to run some checks across your cluster to ensure everything is okay.

Again, if it’s all too much, or you want to play it on the safe side, call Support. 🙂






Stopping a Nutanix Cluster

    July 20, 2017
   ~1 min read
    nutanix

Shutdown all guest VMs Logon to a Controller VM via SSH stop the cluster with cluster stop wait for the CVM to report something similar to the below CVM: xxx.xxx.xxx.xxx Up, ZeusLeader

 Zeus UP [3167, 3180, 3181, 3182, 3191, 3201]
 Scavenger UP [3334, 3351, 3352, 3353]
 ConnectionSplicer DOWN []
 Hyperint DOWN []
 Medusa DOWN []
 DynamicRingChanger DOWN []
 Pithos DOWN []
 Stargate DOWN []
 Cerebro DOWN []
 Chronos DOWN []
 Curator DOWN []
 Prism DOWN []
 AlertManager DOWN []
 StatsAggregator DOWN []
 SysStatCollector DOWN [][]