VMware vSphere: A general system error occurred: Connection refused When starting Virtual Machines.

I had an issue the other day with starting a VM. It would DRS successfully, however fail with “A general system error occurred: Connection refused”

Screen Shot 2017-06-02 at 1.10.34 PM

Googling tells me that the culprit is the vmware-vpx-workflow service being stopped. I SSHed into my VCSA and sure enough found that the service was indeed stopped.

Screen Shot 2017-06-02 at 1.15.32 PM.png

So I attempt starting the service, and that failed.

Screen Shot 2017-06-02 at 1.15.43 PM.png

What the hell? Doing a tail on all the logs in the /var/log/vmware/workflow folder don’t come up with anything. However after re-reading the errors during start I realized…maybe its a disk space issue.

Screen Shot 2017-06-02 at 1.16.17 PM.png

Sure enough, our log disk was full. I grew the log disk, and ran the autogrow command in VMware to resize the disks in the VCSA, restarted the services and VOLA!

Screen Shot 2017-06-02 at 1.16.28 PM.png

After updating the disk size, we were all set and I was able to start VMs without issues.

VMware Paravirtual SCSI adapter: Is it really that much faster?

I asked the same question myself after reading a best practice guide from Datrium that suggested using the VMware PVSCSI controller instead of the default recommendation of the LSI SAS controller that VMware makes when you create a Windows VM.

Out of curiosity I spun up a new server 2016 VM. 4 Cores 8GB of RAM, and a 100GB drive, hosted on my Datrium storage to find out how much of a difference there was.

For this test I ran during a normal production workload, and used Microsoft DiskSpd with a 16k IO size (my current average for my app servers) to test to see what we would get for results. The specific command I used was

diskspd.exe -b16K -d1800 -h -L -o2 -t4 -r -w50 -c10G C:\io.dat

The first run on the VMware LSI SAS controller resulted in this.

Command Line: C:\Users\cjoyce_admin\Downloads\Diskspd-v2.0.17\amd64fre\diskspd.exe -b16K -d1800 -h -L -o2 -t4 -r -w50 -c10G c:\io.dat

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'c:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.00s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 8.35%| 1.84%| 6.50%| 91.65%
1| 8.38%| 1.89%| 6.48%| 91.62%
2| 7.78%| 1.79%| 5.99%| 92.22%
3| 7.39%| 1.60%| 5.79%| 92.61%
-------------------------------------------
avg.| 7.97%| 1.78%| 6.19%| 92.03%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 15150776320 | 924730 | 8.03 | 513.74 | 3.888 | 3.175 | c:\io.dat (10240MB)
1 | 15089106944 | 920966 | 7.99 | 511.65 | 3.904 | 3.289 | c:\io.dat (10240MB)
2 | 15108947968 | 922177 | 8.00 | 512.32 | 3.899 | 3.140 | c:\io.dat (10240MB)
3 | 15109013504 | 922181 | 8.01 | 512.32 | 3.898 | 3.086 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 60457844736 | 3690054 | 32.03 | 2050.03 | 3.897 | 3.173

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7574110208 | 462287 | 4.01 | 256.83 | 3.274 | 2.741 | c:\io.dat (10240MB)
1 | 7539032064 | 460146 | 3.99 | 255.64 | 3.297 | 2.966 | c:\io.dat (10240MB)
2 | 7562526720 | 461580 | 4.01 | 256.43 | 3.297 | 2.861 | c:\io.dat (10240MB)
3 | 7543046144 | 460391 | 4.00 | 255.77 | 3.293 | 2.613 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 30218715136 | 1844404 | 16.01 | 1024.67 | 3.290 | 2.798

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7576666112 | 462443 | 4.01 | 256.91 | 4.501 | 3.448 | c:\io.dat (10240MB)
1 | 7550074880 | 460820 | 4.00 | 256.01 | 4.510 | 3.479 | c:\io.dat (10240MB)
2 | 7546421248 | 460597 | 4.00 | 255.89 | 4.501 | 3.289 | c:\io.dat (10240MB)
3 | 7565967360 | 461790 | 4.01 | 256.55 | 4.503 | 3.389 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 30239129600 | 1845650 | 16.02 | 1025.36 | 4.504 | 3.402
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.000 | 0.000 | 0.000
25th | 1.360 | 2.258 | 1.709
50th | 2.818 | 3.885 | 3.269
75th | 4.481 | 6.093 | 5.443
90th | 6.259 | 8.370 | 7.195
95th | 7.163 | 9.928 | 8.987
99th | 10.090 | 13.425 | 12.593
3-nines | 23.523 | 30.284 | 27.785
4-nines | 47.191 | 52.535 | 49.878
5-nines | 190.339 | 161.402 | 190.339
6-nines | 534.581 | 534.289 | 534.289
7-nines | 545.593 | 535.040 | 545.593
8-nines | 545.593 | 535.040 | 545.593
9-nines | 545.593 | 535.040 | 545.593
max | 545.593 | 535.040 | 545.593

Overall not terrible. Now lets look at what we get when we replace the LSI SAS with a PVSCSI.

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'c:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.00s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 7.37%| 1.53%| 5.84%| 92.63%
1| 7.02%| 1.40%| 5.62%| 92.98%
2| 6.35%| 1.25%| 5.10%| 93.65%
3| 6.04%| 1.22%| 4.82%| 93.96%
-------------------------------------------
avg.| 6.70%| 1.35%| 5.35%| 93.30%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 15667019776 | 956239 | 8.30 | 531.24 | 3.760 | 2.938 | c:\io.dat (10240MB)
1 | 15743369216 | 960899 | 8.34 | 533.83 | 3.741 | 3.011 | c:\io.dat (10240MB)
2 | 15789637632 | 963723 | 8.37 | 535.40 | 3.730 | 2.841 | c:\io.dat (10240MB)
3 | 15788425216 | 963649 | 8.36 | 535.36 | 3.731 | 2.914 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 62988451840 | 3844510 | 33.37 | 2135.84 | 3.740 | 2.926

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7831814144 | 478016 | 4.15 | 265.56 | 2.660 | 2.405 | c:\io.dat (10240MB)
1 | 7862943744 | 479916 | 4.17 | 266.62 | 2.640 | 2.538 | c:\io.dat (10240MB)
2 | 7904346112 | 482443 | 4.19 | 268.02 | 2.632 | 2.247 | c:\io.dat (10240MB)
3 | 7881277440 | 481035 | 4.18 | 267.24 | 2.631 | 2.557 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 31480381440 | 1921410 | 16.68 | 1067.45 | 2.641 | 2.440

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 7835205632 | 478223 | 4.15 | 265.68 | 4.859 | 3.010 | c:\io.dat (10240MB)
1 | 7880425472 | 480983 | 4.18 | 267.21 | 4.840 | 3.045 | c:\io.dat (10240MB)
2 | 7885291520 | 481280 | 4.18 | 267.38 | 4.831 | 2.946 | c:\io.dat (10240MB)
3 | 7907147776 | 482614 | 4.19 | 268.12 | 4.827 | 2.833 | c:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 31508070400 | 1923100 | 16.69 | 1068.39 | 4.839 | 2.959
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.000 | 0.000 | 0.000
25th | 1.189 | 2.947 | 1.810
50th | 1.868 | 4.126 | 3.120
75th | 3.536 | 6.037 | 4.971
90th | 5.392 | 8.026 | 6.924
95th | 6.269 | 9.628 | 8.417
99th | 9.446 | 13.234 | 12.021
3-nines | 22.655 | 32.422 | 28.825
4-nines | 45.679 | 50.249 | 48.554
5-nines | 158.326 | 159.371 | 159.371
6-nines | 475.470 | 427.329 | 427.329
7-nines | 475.711 | 427.338 | 475.711
8-nines | 475.711 | 427.338 | 475.711
9-nines | 475.711 | 427.338 | 475.711
max | 475.711 | 427.338 | 475.711

So overall we see roughly a 4% performance increase across the board. Not groundbreaking numbers, however if you’re trying to squeeze every last drop of performance out of your VMs this could be a big step in the right direction.

Speaking of squeezing every last drop, lets see what happens when we test against a ReFS formatted disk.

Command Line: C:\Users\cjoyce_admin\Downloads\Diskspd-v2.0.17\amd64fre\diskspd.exe -b16K -d1800 -h -L -o2 -t4
10G E:\io.dat

Input parameters:

timespan: 1
-------------
duration: 1800s
warm up time: 5s
cool down time: 0s
measuring latency
random seed: 0
path: 'E:\io.dat'
think time: 0ms
burst size: 0
software cache disabled
hardware write cache disabled, writethrough on
performing mix test (read/write ratio: 50/50)
block size: 16384
using random I/O (alignment: 16384)
number of outstanding I/O operations: 2
thread stride size: 0
threads per file: 4
using I/O Completion Ports
IO priority: normal

Results for timespan 1:
*******************************************************************************

actual test time: 1800.02s
thread count: 4
proc count: 4

CPU | Usage | User | Kernel | Idle
-------------------------------------------
0| 8.65%| 1.62%| 7.03%| 91.35%
1| 8.69%| 1.49%| 7.20%| 91.31%
2| 7.83%| 1.35%| 6.47%| 92.17%
3| 7.43%| 1.36%| 6.07%| 92.57%
-------------------------------------------
avg.| 8.15%| 1.46%| 6.69%| 91.85%

Total IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 18047041536 | 1101504 | 9.56 | 611.94 | 3.263 | 2.708 | E:\io.dat (10240MB)
1 | 18078842880 | 1103445 | 9.58 | 613.02 | 3.258 | 3.004 | E:\io.dat (10240MB)
2 | 18066751488 | 1102707 | 9.57 | 612.61 | 3.260 | 2.712 | E:\io.dat (10240MB)
3 | 18132910080 | 1106745 | 9.61 | 614.85 | 3.248 | 2.727 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 72325545984 | 4414401 | 38.32 | 2452.42 | 3.257 | 2.791

Read IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 9020080128 | 550542 | 4.78 | 305.85 | 2.762 | 2.399 | E:\io.dat (10240MB)
1 | 9030025216 | 551149 | 4.78 | 306.19 | 2.760 | 2.927 | E:\io.dat (10240MB)
2 | 9041592320 | 551855 | 4.79 | 306.58 | 2.759 | 2.342 | E:\io.dat (10240MB)
3 | 9050865664 | 552421 | 4.80 | 306.90 | 2.752 | 2.479 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 36142563328 | 2205967 | 19.15 | 1225.53 | 2.758 | 2.547

Write IO
thread | bytes | I/Os | MB/s | I/O per s | AvgLat | LatStdDev | file
-----------------------------------------------------------------------------------------------------
0 | 9026961408 | 550962 | 4.78 | 306.09 | 3.764 | 2.899 | E:\io.dat (10240MB)
1 | 9048817664 | 552296 | 4.79 | 306.83 | 3.754 | 2.998 | E:\io.dat (10240MB)
2 | 9025159168 | 550852 | 4.78 | 306.03 | 3.762 | 2.954 | E:\io.dat (10240MB)
3 | 9082044416 | 554324 | 4.81 | 307.96 | 3.742 | 2.870 | E:\io.dat (10240MB)
-----------------------------------------------------------------------------------------------------
total: 36182982656 | 2208434 | 19.17 | 1226.90 | 3.756 | 2.931
%-ile | Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
min | 0.267 | 0.297 | 0.267
25th | 1.252 | 1.773 | 1.403
50th | 2.019 | 3.097 | 2.618
75th | 3.724 | 5.038 | 4.275
90th | 5.581 | 6.998 | 6.240
95th | 6.395 | 8.584 | 7.525
99th | 9.641 | 12.213 | 11.021
3-nines | 20.505 | 26.232 | 23.305
4-nines | 42.971 | 45.559 | 44.280
5-nines | 238.498 | 175.573 | 204.921
6-nines | 502.382 | 359.149 | 435.862
7-nines | 547.128 | 547.124 | 547.128
8-nines | 547.128 | 547.124 | 547.128
9-nines | 547.128 | 547.124 | 547.128
max | 547.128 | 547.124 | 547.128

With a ReFS formatted disk on top of PVSCSI we see a 17% increase!

So if your applications support it, and you truly want to squeeze every last drop out of your storage, ReFS and PVSCSI is the combination to go with!

How to fix “A general system error occurred: vim.fault.GenericVmConfigFault” When creating or Removing Snapshots in VMware.

Recently I had a bunch of virtual machines that started generating this error during Veeam backups. I hadn’t bothered to really be checking my snaps because my daily job was supposed to be taking care of that for me. Unfortunately, this bit me and Here we are

As many of you have probably experienced, Veeam doesn’t always clean up after itself when it is finished backing up a VM. Sometimes a file lock or other operation prevents the snap cleanup, and you end up with a huge chain of snapshots.

Normally you would just be able to right click the VM, select Snapshots, then “delete all” and be done, however I was getting.

A general system error occurred: vim.fault.GenericVmConfigFaultScreen Shot 2017-04-21 at 1.40.54 PM

So how do we fix this? Well it turns out there are two ways.

Resolution 1:

This resolution requires downtime, however is significantly faster. It does however have some caveats I will get to later.

  1. Shut down the Virtual Machine.
  2. SSH to the host where the VM was running.
  3. Change Directory to the volume where the guests disks are stored.
    cd /vmfs/volumes/volumguid/machinefolder
    
  4. Find all delta.vmdk for the VM
    ls -ltrh | grep delta.vmdk
    
  5. If it looks something like this you’re good to go. Notice how of the 5 snapshot deltas, only 0x5.delta.vmdk actually has data, the others are empty?
    Screen Shot 2017-04-21 at 1.48.54 PM
  6. If it looks like this, you’re going to have to use Resolution 2. Notice how multiple of the snapshot deltas hold data?
    Screen Shot 2017-04-21 at 1.51.07 PM
  7. Create a new folder in the VMs folder.
    mkdir ./tmp
    
  8. Move all of the EMPTY snapshot data to the /tmp folder. DO NOT DELETE IT. If the remaining steps do not work we have to restore these files.
    mv VM_1-000001* ./tmp
    
  9.  Once all Snapshots are moved, and you are left with your base disk and your last snapshot. Open VMware vSphere Client, and create a new VM. Make it identical to the old VM, however DO NOT add a disk. Remove the disk VMware wants to create, and select “add existing disk” Select the snapshot disk, NOT the base disk, and attach it to the VM. Do not select “Power On VM after Creation.”
  10. Once creation is complete you will notice that VMware will show “Virtual machine disks need consolidation.” Right click the VM and chose to consolidate the disk. Once consolidation is complete, boot the VM and verify functionality. You should be all set!

Resolution 2:

If your VM is not recoverable with Resolution 1, your alternate resolution is to use the VMware Standalone converter to convert a “Powered On Windows / Linux Machine.” So yes you are doing a P2V migration of a Virtual machine to a new virtual machine.

This isn’t the best solution in the world and its essentially the same as taking / restoring an OS based backup instead of a VM based one, however it allows your VM to be online during the entire operation which could be a consideration for those who have very small maintenance windows and may not be able to do a full consolidation in that window.

Automatically clean up VMware snapshots using PowerCLI.

Something that every VMware admin who also uses Veeam has had to deal with once if not multiple times is Veeam not cleaning up snapshots properly which then leads to broken snapshot chains, leading to VMDK corruption, and finally leading to an admin crying into his / her bourbon realizing that “no the VM isn’t coming back up, no I can’t consolidate the snap chain and recover it, and no there haven’t been backups in $n days to recover from.” A process I’ve put into my own environment to prevent this is a simple PowerCLI script that looks for all snapshots over 24 hours old, and removes them. VMware recommends that you never have more than 3 snapshots in a chain and those should never be over 72 hours old from a performance standpoint. Personally I agree completely with that. Snapshots should only be used before making a big change so you can quickly roll back, not something that you create and then live off of.

This script requires that the VMware PowerCLI modules are installed on the system you run it from.

#################################################################
#   Remove all snapshots from vSphere from the last 24 Hours    #
#   Created by - Cameron Joyce                                  #
#   Last Modified - Jun 19 2017                                 #
#################################################################
# This script uses PowerCLI to remove all snapshots from virtual machines that are 24 hours old. 

# Load all VMware Modules, and set PowerCLI config.
Get-Module -ListAvailable VM* | Import-Module

# Connect to vSphere vCenter Server.
Try{
    connect-viserver -server your.vmware.server -user administrator@vsphere.local -Password Password
}
Catch{
    Write-Host "Failed Connecting to VSphere Server."
    Send-MailMessage -From "" -To "server@domain.com" -Subject "Unable to Connect to VSphere to clean snapshots" -Body `
    "The powershell script is unable to connect to host your.vmware.server. Please investigate." -SmtpServer "smtp.server.com"
    Break
}

# Variables
$date = get-date -f MMddyyyy
$logpath = "C:\Scripts\Script_Logs"

# Verify the log folder exists.
If(!(Test-Path $logpath)){
    Write-Host "Log path not found, creating folder."
    New-Item $logpath -Type Directory
}

# Get all snapshots older than 24 hours, remove them.
If((get-snapshot -vm *) -ne $null){
    $snapshotlist = get-snapshot -vm * | select VM, Name, SizeMB, @{Name="Age";Expression={((Get-Date)-$_.Created).Days}}
    Write-Host "Current Snapshots in Dallas vSphere"
    Write-Output $snapshotlist
    Write-Output "Snapshots existing before cleanup" | Out-File $logpath\Snapshots_$date.txt -Append
    Write-Output $snapshotlist | Out-File $logpath\Snapshots_$date.txt -Append
}

# Check to make sure that all snapshots have been cleaned up.
If((get-snapshot -vm *) -ne $null){
    get-snapshot -vm * | Where-Object {$_.Created -lt (Get-Date).AddDays(-1)} | Remove-Snapshot -Confirm:$false
    $snapshotlist = get-snapshot -vm * | select VM, Name, SizeMB, @{Name="Age";Expression={((Get-Date)-$_.Created).Days}}
    Write-Host "Current Snapshots in Dallas vSphere after cleanup"
    Write-Output $snapshotlist
    Write-Output "Snapshots existing after cleanup" | Out-File $logpath\Snapshots_$date.txt -Append
    Write-Output $snapshotlist | Out-File $logpath\Snapshots_$date.txt -Append
}
Else{
    Write-Output "No Snapshots to clean up." | Out-File $logpath\Snapshots_$date.txt -Append
}

# Send snapshot log to email.
$emailbody = (Get-Content $logpath\Snapshots_$date.txt | Out-String)
Send-MailMessage -From "server@domain.com" -To "user@domain.com.com" -Subject "Daily vSphere snapshot cleanup report" -Body $emailbody -SmtpServer "smtp.server.com"

# Exit VIM server session.
Try{
    disconnect-viserver -server your.vmware.server -Confirm:$false
}
Catch{
    Write-Host "Failed disconnecting from VSphere."
    Send-MailMessage -From "server@domain.com" -To "user@domain.com" -Subject "Disconnection from VSphere Failed" -Body `
    "The powershell script is unable to disconnect from VSphere. Please manually disconnect" -SmtpServer "smtp.server.com"
}

# Cleanup Snapshot logs older than 30 days.
gci -path $logpath -Recurse -Force | Where-Object {!$_.PSIsContainer -and $_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Force

There is more logic in here for sending email alerts than actual VMware commands, however when this is running automated from a PS job, it is super helpful to have all the emails.