VMware Flash Read Cache (VFRC) Performance. Does it really make a difference?

I’ve been generally disappointed with the performance of my MS S2D storage array due to the way MS lays writes down onto parity pools. All the SSD caching and NVMe write cache just wasn’t doing enough and it was time to go a different route. I’ve written about my Datrium install a few times now, and for those who are unfamiliar with the solution the short version is that Datrium takes host populated flash and builds a dedicated Read / Write pool on the host. It does an extremely good job of this and if you would like to know more check out their website here https://www.datrium.com/open-convergence-architecture I like that I am able to populate my host with “cheap[” SSD for performance, and then dump the data to cheaper spinning disk for cold storage.

So how can I do this at home without spending a bunch of money? My first though was Dell’s Cachecade which I have on the PERC H700 I used for this test. The problem with Cachecade however is that it has a hard limitation of 512GB per host, reads above 64kb are not cached, and there is no support for Write cache. I then thought about VSAN, and using the vFlash caching in that product, however given my lab and limited resources, VSAN would be a no-go as well.

Finally I came to VFRC. Like cachecade it is only a read side cache, however unlike Cachecade it will cache all reads, it can generate up to a 32TB pool with a maximum of 400GB of cache per VMDK. The downsides of VFRC are that you have to enable it per VMDK, and you have to set the cache block size to the average of your workload’s IO size otherwise you will degrade the performance of the cache. That said it is extremely easy to get setup and running, and it is included in your ESXi Enterprise Plus licensing, so you don’t get hit with additional cost for VSAN licensing (though that said, VSAN cache tiering would provide a significantly better performance experience).

Let’s look at the test VM. The server is a Dell R510 w/ 2x Xeon E5640 CPUs, 64GB of RAM, and the PERC H700 1GB BBWC card. I have 3 WD RE4 2TB drives + 1 Samsung 850 Pro 256GB SSD. The server is running ESXi 6.5 U1, and the VM is a 4 vCPU 6GB vRAM PVSCSI controller VM.

First up let’s see what this machine does with a VM running directly on the OCZ RD400 NVMe AIC. This is a Thin provisioned .vmdk living on a VMFS v6 volume.

Screen Shot 2017-08-20 at 7.18.32 PM

This test shows that a 8k 70/30 Random Read/Write test against a 5GB file with 30 minutes of sustained IO generated 165MBps Throughput, 21,193 IOPS, and averaged 0.3ms of read latency. This is fantastic and exactly what I would expect of an NVMe disk.

So now let’s run exactly the same test against my VFRC volume. This is a RAID 5 of 3x SATA6 2TB Western Digital RE4 drives + 50GB of VFRC at an 8k block size.

Screen Shot 2017-08-20 at 7.19.31 PM

For a SATA RAID5 that’s not bad. 3.3MBps, 429 IOPS, and an average read latency of 18ms. This isn’t amazing performance by any means, but again for a 3 disk SATA array this is pretty fantastic.

Let’s compare that to a disk with no vFlash Cache.

Screen Shot 2017-08-20 at 7.19.45 PM

1.87MBps, 239 IOPS, and 33ms average latencyu. The VFRC disk is literally 2x the performance of the traditional disk running on the same array.

So as you can see VFRC is a great option for those who have traditional storage, either local or SAN, where a flash tier would normally not be available.

Finally, let’s look at the same workload on my Datrium cluster.

Screen Shot 2017-08-20 at 7.20.24 PM

As you can see, this is not going to be a replacement for having dedicated flash resources, and Datrium caches much more efficiently than VMware does, however if you need a quick “Cheap” fix, VFRC is a great answer.

Advertisements

VMware vSphere: A general system error occurred: Connection refused When starting Virtual Machines.

I had an issue the other day with starting a VM. It would DRS successfully, however fail with “A general system error occurred: Connection refused”

Screen Shot 2017-06-02 at 1.10.34 PM

Googling tells me that the culprit is the vmware-vpx-workflow service being stopped. I SSHed into my VCSA and sure enough found that the service was indeed stopped.

Screen Shot 2017-06-02 at 1.15.32 PM.png

So I attempt starting the service, and that failed.

Screen Shot 2017-06-02 at 1.15.43 PM.png

What the hell? Doing a tail on all the logs in the /var/log/vmware/workflow folder don’t come up with anything. However after re-reading the errors during start I realized…maybe its a disk space issue.

Screen Shot 2017-06-02 at 1.16.17 PM.png

Sure enough, our log disk was full. I grew the log disk, and ran the autogrow command in VMware to resize the disks in the VCSA, restarted the services and VOLA!

Screen Shot 2017-06-02 at 1.16.28 PM.png

After updating the disk size, we were all set and I was able to start VMs without issues.

VMware Paravirtual SCSI adapter: Is it really that much faster?

I asked the same question myself after reading a best practice guide from Datrium that suggested using the VMware PVSCSI controller instead of the default recommendation of the LSI SAS controller that VMware makes when you create a Windows VM.

Out of curiosity I spun up a new server 2016 VM. 4 Cores 8GB of RAM, and a 100GB drive, hosted on my Datrium storage to find out how much of a difference there was.

For this test I ran during a normal production workload, and used Microsoft DiskSpd with a 16k IO size (my current average for my app servers) to test to see what we would get for results. The specific command I used was

diskspd.exe -b16K -d1800 -h -L -o2 -t4 -r -w50 -c10G C:\io.dat

The first run on the VMware LSI SAS controller resulted in this.

Command Line: C:\Users\cjoyce_admin\Downloads\Diskspd-v2.0.17\amd64fre\diskspd.exe -b16K -d1800 -h -L -o2 -t4 -r -w50 -c10G c:\io.dat

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'c:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.00s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 8.35%| 1.84%| 6.50%| 91.65%
1| 8.38%| 1.89%| 6.48%| 91.62%
2| 7.78%| 1.79%| 5.99%| 92.22%
3| 7.39%| 1.60%| 5.79%| 92.61%
-------------------------------------------
avg.| 7.97%| 1.78%| 6.19%| 92.03%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 15150776320 | 924730 | 8.03 | 513.74 | 3.888 | 3.175 | c:\io.dat (10240MB)
1 | 15089106944 | 920966 | 7.99 | 511.65 | 3.904 | 3.289 | c:\io.dat (10240MB)
2 | 15108947968 | 922177 | 8.00 | 512.32 | 3.899 | 3.140 | c:\io.dat (10240MB)
3 | 15109013504 | 922181 | 8.01 | 512.32 | 3.898 | 3.086 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 60457844736 | 3690054 | 32.03 | 2050.03 | 3.897 | 3.173

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7574110208 | 462287 | 4.01 | 256.83 | 3.274 | 2.741 | c:\io.dat (10240MB)
1 | 7539032064 | 460146 | 3.99 | 255.64 | 3.297 | 2.966 | c:\io.dat (10240MB)
2 | 7562526720 | 461580 | 4.01 | 256.43 | 3.297 | 2.861 | c:\io.dat (10240MB)
3 | 7543046144 | 460391 | 4.00 | 255.77 | 3.293 | 2.613 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 30218715136 | 1844404 | 16.01 | 1024.67 | 3.290 | 2.798

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7576666112 | 462443 | 4.01 | 256.91 | 4.501 | 3.448 | c:\io.dat (10240MB)
1 | 7550074880 | 460820 | 4.00 | 256.01 | 4.510 | 3.479 | c:\io.dat (10240MB)
2 | 7546421248 | 460597 | 4.00 | 255.89 | 4.501 | 3.289 | c:\io.dat (10240MB)
3 | 7565967360 | 461790 | 4.01 | 256.55 | 4.503 | 3.389 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 30239129600 | 1845650 | 16.02 | 1025.36 | 4.504 | 3.402
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.000 | 0.000 | 0.000
25th | 1.360 | 2.258 | 1.709
50th | 2.818 | 3.885 | 3.269
75th | 4.481 | 6.093 | 5.443
90th | 6.259 | 8.370 | 7.195
95th | 7.163 | 9.928 | 8.987
99th | 10.090 | 13.425 | 12.593
3-nines | 23.523 | 30.284 | 27.785
4-nines | 47.191 | 52.535 | 49.878
5-nines | 190.339 | 161.402 | 190.339
6-nines | 534.581 | 534.289 | 534.289
7-nines | 545.593 | 535.040 | 545.593
8-nines | 545.593 | 535.040 | 545.593
9-nines | 545.593 | 535.040 | 545.593
max | 545.593 | 535.040 | 545.593

Overall not terrible. Now lets look at what we get when we replace the LSI SAS with a PVSCSI.

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'c:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.00s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 7.37%| 1.53%| 5.84%| 92.63%
1| 7.02%| 1.40%| 5.62%| 92.98%
2| 6.35%| 1.25%| 5.10%| 93.65%
3| 6.04%| 1.22%| 4.82%| 93.96%
-------------------------------------------
avg.| 6.70%| 1.35%| 5.35%| 93.30%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 15667019776 | 956239 | 8.30 | 531.24 | 3.760 | 2.938 | c:\io.dat (10240MB)
1 | 15743369216 | 960899 | 8.34 | 533.83 | 3.741 | 3.011 | c:\io.dat (10240MB)
2 | 15789637632 | 963723 | 8.37 | 535.40 | 3.730 | 2.841 | c:\io.dat (10240MB)
3 | 15788425216 | 963649 | 8.36 | 535.36 | 3.731 | 2.914 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 62988451840 | 3844510 | 33.37 | 2135.84 | 3.740 | 2.926

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7831814144 | 478016 | 4.15 | 265.56 | 2.660 | 2.405 | c:\io.dat (10240MB)
1 | 7862943744 | 479916 | 4.17 | 266.62 | 2.640 | 2.538 | c:\io.dat (10240MB)
2 | 7904346112 | 482443 | 4.19 | 268.02 | 2.632 | 2.247 | c:\io.dat (10240MB)
3 | 7881277440 | 481035 | 4.18 | 267.24 | 2.631 | 2.557 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 31480381440 | 1921410 | 16.68 | 1067.45 | 2.641 | 2.440

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7835205632 | 478223 | 4.15 | 265.68 | 4.859 | 3.010 | c:\io.dat (10240MB)
1 | 7880425472 | 480983 | 4.18 | 267.21 | 4.840 | 3.045 | c:\io.dat (10240MB)
2 | 7885291520 | 481280 | 4.18 | 267.38 | 4.831 | 2.946 | c:\io.dat (10240MB)
3 | 7907147776 | 482614 | 4.19 | 268.12 | 4.827 | 2.833 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 31508070400 | 1923100 | 16.69 | 1068.39 | 4.839 | 2.959
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.000 | 0.000 | 0.000
25th | 1.189 | 2.947 | 1.810
50th | 1.868 | 4.126 | 3.120
75th | 3.536 | 6.037 | 4.971
90th | 5.392 | 8.026 | 6.924
95th | 6.269 | 9.628 | 8.417
99th | 9.446 | 13.234 | 12.021
3-nines | 22.655 | 32.422 | 28.825
4-nines | 45.679 | 50.249 | 48.554
5-nines | 158.326 | 159.371 | 159.371
6-nines | 475.470 | 427.329 | 427.329
7-nines | 475.711 | 427.338 | 475.711
8-nines | 475.711 | 427.338 | 475.711
9-nines | 475.711 | 427.338 | 475.711
max | 475.711 | 427.338 | 475.711

So overall we see roughly a 4% performance increase across the board. Not groundbreaking numbers, however if you’re trying to squeeze every last drop of performance out of your VMs this could be a big step in the right direction.

Speaking of squeezing every last drop, lets see what happens when we test against a ReFS formatted disk.

Command Line: C:\Users\cjoyce_admin\Downloads\Diskspd-v2.0.17\amd64fre\diskspd.exe -b16K -d1800 -h -L -o2 -t4
10G E:\io.dat

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'E:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.02s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 8.65%| 1.62%| 7.03%| 91.35%
1| 8.69%| 1.49%| 7.20%| 91.31%
2| 7.83%| 1.35%| 6.47%| 92.17%
3| 7.43%| 1.36%| 6.07%| 92.57%
-------------------------------------------
avg.| 8.15%| 1.46%| 6.69%| 91.85%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 18047041536 | 1101504 | 9.56 | 611.94 | 3.263 | 2.708 | E:\io.dat (10240MB)
1 | 18078842880 | 1103445 | 9.58 | 613.02 | 3.258 | 3.004 | E:\io.dat (10240MB)
2 | 18066751488 | 1102707 | 9.57 | 612.61 | 3.260 | 2.712 | E:\io.dat (10240MB)
3 | 18132910080 | 1106745 | 9.61 | 614.85 | 3.248 | 2.727 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 72325545984 | 4414401 | 38.32 | 2452.42 | 3.257 | 2.791

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 9020080128 | 550542 | 4.78 | 305.85 | 2.762 | 2.399 | E:\io.dat (10240MB)
1 | 9030025216 | 551149 | 4.78 | 306.19 | 2.760 | 2.927 | E:\io.dat (10240MB)
2 | 9041592320 | 551855 | 4.79 | 306.58 | 2.759 | 2.342 | E:\io.dat (10240MB)
3 | 9050865664 | 552421 | 4.80 | 306.90 | 2.752 | 2.479 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 36142563328 | 2205967 | 19.15 | 1225.53 | 2.758 | 2.547

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 9026961408 | 550962 | 4.78 | 306.09 | 3.764 | 2.899 | E:\io.dat (10240MB)
1 | 9048817664 | 552296 | 4.79 | 306.83 | 3.754 | 2.998 | E:\io.dat (10240MB)
2 | 9025159168 | 550852 | 4.78 | 306.03 | 3.762 | 2.954 | E:\io.dat (10240MB)
3 | 9082044416 | 554324 | 4.81 | 307.96 | 3.742 | 2.870 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 36182982656 | 2208434 | 19.17 | 1226.90 | 3.756 | 2.931
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.267 | 0.297 | 0.267
25th | 1.252 | 1.773 | 1.403
50th | 2.019 | 3.097 | 2.618
75th | 3.724 | 5.038 | 4.275
90th | 5.581 | 6.998 | 6.240
95th | 6.395 | 8.584 | 7.525
99th | 9.641 | 12.213 | 11.021
3-nines | 20.505 | 26.232 | 23.305
4-nines | 42.971 | 45.559 | 44.280
5-nines | 238.498 | 175.573 | 204.921
6-nines | 502.382 | 359.149 | 435.862
7-nines | 547.128 | 547.124 | 547.128
8-nines | 547.128 | 547.124 | 547.128
9-nines | 547.128 | 547.124 | 547.128
max | 547.128 | 547.124 | 547.128

With a ReFS formatted disk on top of PVSCSI we see a 17% increase!

So if your applications support it, and you truly want to squeeze every last drop out of your storage, ReFS and PVSCSI is the combination to go with!

How to fix “A general system error occurred: vim.fault.GenericVmConfigFault” When creating or Removing Snapshots in VMware.

Recently I had a bunch of virtual machines that started generating this error during Veeam backups. I hadn’t bothered to really be checking my snaps because my daily job was supposed to be taking care of that for me. Unfortunately, this bit me and Here we are

As many of you have probably experienced, Veeam doesn’t always clean up after itself when it is finished backing up a VM. Sometimes a file lock or other operation prevents the snap cleanup, and you end up with a huge chain of snapshots.

Normally you would just be able to right click the VM, select Snapshots, then “delete all” and be done, however I was getting.

A general system error occurred: vim.fault.GenericVmConfigFaultScreen Shot 2017-04-21 at 1.40.54 PM

So how do we fix this? Well it turns out there are two ways.

Resolution 1:

This resolution requires downtime, however is significantly faster. It does however have some caveats I will get to later.

  1. Shut down the Virtual Machine.
  2. SSH to the host where the VM was running.
  3. Change Directory to the volume where the guests disks are stored.
    cd /vmfs/volumes/volumguid/machinefolder
    
  4. Find all delta.vmdk for the VM
    ls -ltrh | grep delta.vmdk
    
  5. If it looks something like this you’re good to go. Notice how of the 5 snapshot deltas, only 0x5.delta.vmdk actually has data, the others are empty?
    Screen Shot 2017-04-21 at 1.48.54 PM
  6. If it looks like this, you’re going to have to use Resolution 2. Notice how multiple of the snapshot deltas hold data?
    Screen Shot 2017-04-21 at 1.51.07 PM
  7. Create a new folder in the VMs folder.
    mkdir ./tmp
    
  8. Move all of the EMPTY snapshot data to the /tmp folder. DO NOT DELETE IT. If the remaining steps do not work we have to restore these files.
    mv VM_1-000001* ./tmp
    
  9.  Once all Snapshots are moved, and you are left with your base disk and your last snapshot. Open VMware vSphere Client, and create a new VM. Make it identical to the old VM, however DO NOT add a disk. Remove the disk VMware wants to create, and select “add existing disk” Select the snapshot disk, NOT the base disk, and attach it to the VM. Do not select “Power On VM after Creation.”
  10. Once creation is complete you will notice that VMware will show “Virtual machine disks need consolidation.” Right click the VM and chose to consolidate the disk. Once consolidation is complete, boot the VM and verify functionality. You should be all set!

Resolution 2:

If your VM is not recoverable with Resolution 1, your alternate resolution is to use the VMware Standalone converter to convert a “Powered On Windows / Linux Machine.” So yes you are doing a P2V migration of a Virtual machine to a new virtual machine.

This isn’t the best solution in the world and its essentially the same as taking / restoring an OS based backup instead of a VM based one, however it allows your VM to be online during the entire operation which could be a consideration for those who have very small maintenance windows and may not be able to do a full consolidation in that window.

Manually adding hosts to Datrium DVX without the vCenter plugin.

I will be doing a full writeup later on my experience with Datrium DVX, but I wanted to make sure that I got this out before I forgot.

DVX 1.1.x does not support VMware 6.0 U3. I found this out of course AFTER I updated my vCenter appliance and the DVX plugin kept erroring and crashing. I needed to add 3 new hosts to my DVX cluster so that we could expand from our POC to our prod environment, but didn’t want to downgrade vCenter. Datrium has an excellent stand alone management console, but the one thing that console can’t do is add new hosts, so how do we do this?

BEFORE WE CONTINUE I HAVE SOME DISCLAIMERS!

  1. This is not the process to add disks to your existing Datrium array. If you already have an array and want to use the dacli to add disks, we will cover that later.
  2. This is NOT a best practice. This is an emergency procedure.
  3. Datrium support is awesome. Just call them and have them work through this with you.

Now that we have that out of the way, if you still want to be self sufficient, and do this on your own here are your steps.

  1. SSH to your ESX host you would like to add to your DVX cluster.
  2. install the DVX vib on that host.
esxcli software vib install -d http://datrium.mgmt.float.ip/static/esxVibHEAD/index.xml --no-sig-check
  1. Set source to /etc/profile
source /etc/profile
  1. Exit and re-enter the SSH session using the “exit” command.
  2. Enter the dacli
da
  1. Run dacli to select all SSDs for use. This will warn you that it will scan all drives in the system, and wipe all SSDs. This is true, however it will ignore any SSDs with the VMware OS partitions on them. So if your host is running ESXi on an SSD, you are safe. Press Y when prompted.
dacli SSD select-all

6a. If no SSDs are found, you will need to wipe the SSD and rebuild them as GPT disks. I had this issue with my Samsung PM863a 960GB drives. I did not have this issue with my Intel DC P3608 4TB AIC. For each disk you wish to add run the following command.

partedUtil setptbl /vmfs/devices/disks/yourdiskidhere gpt

Then once complete you will be able to run dacli SSD select-all

  1. Premount the NFS target
premount datrium.data.float.ip
  1. In VMware vCenter Web Client find the Datrium NFS datastore, select actions, mount to new host, then select your new host you just added.

Screen Shot 2017-04-04 at 12.18.31 PM

Screen Shot 2017-04-04 at 12.18.41 PM

Screen Shot 2017-04-04 at 12.18.55 PM

We mount this way because if you use esxcfg-nas to mount the store, it will duplicate, and you will not be able to live migrate VMs between hosts.

Not too terribly difficult, however, this is a last ditch 4am it’s gotta be up in 2 hours resolution type thing and NOT by any means how you should be adding hosts to your cluster.

Automatically clean up VMware snapshots using PowerCLI.

Something that every VMware admin who also uses Veeam has had to deal with once if not multiple times is Veeam not cleaning up snapshots properly which then leads to broken snapshot chains, leading to VMDK corruption, and finally leading to an admin crying into his / her bourbon realizing that “no the VM isn’t coming back up, no I can’t consolidate the snap chain and recover it, and no there haven’t been backups in $n days to recover from.” A process I’ve put into my own environment to prevent this is a simple PowerCLI script that looks for all snapshots over 24 hours old, and removes them. VMware recommends that you never have more than 3 snapshots in a chain and those should never be over 72 hours old from a performance standpoint. Personally I agree completely with that. Snapshots should only be used before making a big change so you can quickly roll back, not something that you create and then live off of.

This script requires that the VMware PowerCLI modules are installed on the system you run it from.

#################################################################
#   Remove all snapshots from vSphere from the last 24 Hours    #
#   Created by - Cameron Joyce                                  #
#   Last Modified - Jun 19 2017                                 #
#################################################################
# This script uses PowerCLI to remove all snapshots from virtual machines that are 24 hours old. 

# Load all VMware Modules, and set PowerCLI config.
Get-Module -ListAvailable VM* | Import-Module

# Connect to vSphere vCenter Server.
Try{
    connect-viserver -server your.vmware.server -user administrator@vsphere.local -Password Password
}
Catch{
    Write-Host "Failed Connecting to VSphere Server."
    Send-MailMessage -From "" -To "server@domain.com" -Subject "Unable to Connect to VSphere to clean snapshots" -Body `
    "The powershell script is unable to connect to host your.vmware.server. Please investigate." -SmtpServer "smtp.server.com"
    Break
}

# Variables
$date = get-date -f MMddyyyy
$logpath = "C:\Scripts\Script_Logs"

# Verify the log folder exists.
If(!(Test-Path $logpath)){
    Write-Host "Log path not found, creating folder."
    New-Item $logpath -Type Directory
}

# Get all snapshots older than 24 hours, remove them.
If((get-snapshot -vm *) -ne $null){
    $snapshotlist = get-snapshot -vm * | select VM, Name, SizeMB, @{Name="Age";Expression={((Get-Date)-$_.Created).Days}}
    Write-Host "Current Snapshots in Dallas vSphere"
    Write-Output $snapshotlist
    Write-Output "Snapshots existing before cleanup" | Out-File $logpath\Snapshots_$date.txt -Append
    Write-Output $snapshotlist | Out-File $logpath\Snapshots_$date.txt -Append
}

# Check to make sure that all snapshots have been cleaned up.
If((get-snapshot -vm *) -ne $null){
    get-snapshot -vm * | Where-Object {$_.Created -lt (Get-Date).AddDays(-1)} | Remove-Snapshot -Confirm:$false
    $snapshotlist = get-snapshot -vm * | select VM, Name, SizeMB, @{Name="Age";Expression={((Get-Date)-$_.Created).Days}}
    Write-Host "Current Snapshots in Dallas vSphere after cleanup"
    Write-Output $snapshotlist
    Write-Output "Snapshots existing after cleanup" | Out-File $logpath\Snapshots_$date.txt -Append
    Write-Output $snapshotlist | Out-File $logpath\Snapshots_$date.txt -Append
}
Else{
    Write-Output "No Snapshots to clean up." | Out-File $logpath\Snapshots_$date.txt -Append
}

# Send snapshot log to email.
$emailbody = (Get-Content $logpath\Snapshots_$date.txt | Out-String)
Send-MailMessage -From "server@domain.com" -To "user@domain.com.com" -Subject "Daily vSphere snapshot cleanup report" -Body $emailbody -SmtpServer "smtp.server.com"

# Exit VIM server session.
Try{
    disconnect-viserver -server your.vmware.server -Confirm:$false
}
Catch{
    Write-Host "Failed disconnecting from VSphere."
    Send-MailMessage -From "server@domain.com" -To "user@domain.com" -Subject "Disconnection from VSphere Failed" -Body `
    "The powershell script is unable to disconnect from VSphere. Please manually disconnect" -SmtpServer "smtp.server.com"
}

# Cleanup Snapshot logs older than 30 days.
gci -path $logpath -Recurse -Force | Where-Object {!$_.PSIsContainer -and $_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Force

There is more logic in here for sending email alerts than actual VMware commands, however when this is running automated from a PS job, it is super helpful to have all the emails.