Software Performance Optimizations for F2.48xlarge Instances#
This guide outlines strategies for maximizing performance on
f2.48xlarge instances through effective CPU-to-FPGA mapping and NUMA
optimization. In dual-socket configurations, implementing NUMA-aware
techniques is essential for minimizing latency and maximizing PCIe
bandwidth between CPUs and FPGA accelerators. The optimizations within
this document do not apply to f2.6xlarge and f2.12xlarge
instances because all CPU resources, memory, NVMe devices, and FPGA
devices are on a single NUMA node.
Quick Start Guide#
To optimize your application’s performance running on an f2.48xlarge
instance, refer to the Script to Construct a FPGA to NUMA Node and vCPU
Mapping
or follow the steps below:
- Determine the FPGA slot numbers using - fpga-describe-local-image
- Locate the optimal vCPUs for your slot in the mapping table 
- Apply CPU pinning using either: - numactl --localalloc --physcpubind <vCPU list> <bash command>command
- Application-specific CPU affinity settings 
 
F2 Instance Overview#
The f2.48xlarge instance consists of 2 AMD Milan CPUs in a
dual-socket configuration with 192 vCPUs, 2,048 GiB (2 TiB) of memory,
7,600 GiB across 8 NVMe SSDs, and 8 FPGAs. Each socket directly connects
to 96 vCPUs, 1 TiB of memory, 3,800 GiB across 4 NVMe SSDs, and 4 FPGAs.
The networking interface directly connects to CPU socket 0.
Linux System Tools#
The following Linux tools help verify system configuration:
| Tool | Purpose | Installation/Usage | 
|---|---|---|
| 
 | View PCI topology | Built-in | 
| 
 | Visualize system topology | 
 | 
| 
 | View CPU/NUMA mapping | Built-in | 
| 
 | Show NUMA configuration | Built-in | 
Note: For FPGA visibility, use lstopo --whole-io or
lstopo-no-graphics --whole-io
NUMA Best Practices for F2 Instances#
NUMA (Non-Uniform Memory Access) is a computer memory design where
memory access time depends on the memory location relative to a
processor. In a NUMA system, a processor can access its own local memory
faster than non-local memory (memory local to another processor or
memory shared between processors). The f2.48xlarge instance has two
NUMA nodes, each associated with a CPU socket and all colocated devices:
- Each NUMA node contains: - 96 vCPUs 
- 1 TiB of local memory 
- 4 FPGAs 
- 4 NVMe drives 
 
- Memory access characteristics: - Local memory access (same NUMA node) results in the lowest latency 
- Remote memory access (different NUMA node) results in higher latency 
 
Why NUMA Matters#
NUMA awareness is crucial for performance because:
- Local memory access is significantly faster than remote access 
- PCIe devices (like FPGAs) perform best when the controlling process runs on CPUs in the same NUMA node 
- Memory bandwidth is higher for local access 
- Improper NUMA alignment can cause significant performance degradation 
This is why the vCPU to FPGA mapping table in this guide is important - it ensures your application uses the optimal CPU cores for each FPGA device.
Identifying the CPU to FPGA NUMA Mapping#
The bus:device:function (BDF) mapping of FPGA devices is in slot
order. On an f2.48xlarge instance, the lowest BDF hex value will be slot
0 and the highest BDF hex value will be slot 7. The
fpga-describe-local-image command will display this:
$ sudo fpga-describe-local-image -S 0 -H
Type  FpgaImageSlot  FpgaImageId             StatusName    StatusCode   ErrorName    ErrorCode   ShVersion
AFI          0       No AFI                  cleared           1        ok               0       0x10162423
Type  FpgaImageSlot  VendorId    DeviceId    DBDF
AFIDEVICE    0       0x1d0f      0x9048      0000:9f:00.0```
The NUMA node for this device can be found in the Linux PCI hierarchy:
```bash
$ cat /sys/bus/pci/devices/0000\:9f\:00.0/numa_node
1
The vCPU NUMA node mappings can be found with numactl -H. An
f2.48xlarge instance would display the following:
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 1023962 MB
node 0 free: 1021988 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 1 size: 1023981 MB
node 1 free: 1022619 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10
Ideal vCPU to FPGA Mapping for Optimal PCIe Performance#
Processes can be pinned to particular vCPUs by using Linux tools such as
numactl. The “Optimal vCPUs” below refer to the optimal 16 vCPUs
(shared L3 cache) for that slot. Below is a table of the optimal vCPUs
for each FPGA slot on an f2.48xlarge instance.
| FPGA Slot # | NUMA Node | Optimal vCPUs | Example of  | Colocated vCPUs | 
|---|---|---|---|---|
| 0 | 1 | 48-55, 144-151 | 
 | 48-95, 144-191 | 
| 1 | 1 | 56-63, 152-159 | 
 | 48-95, 144-191 | 
| 2 | 1 | 64-71, 160-167 | 
 | 48-95, 144-191 | 
| 3 | 1 | 72-79, 168-175 | 
 | 48-95, 144-191 | 
| 4 | 0 | 0-7, 96-103 | 
 | 0-47, 96-143 | 
| 5 | 0 | 8-15, 104-111 | 
 | 0-47, 96-143 | 
| 6 | 0 | 16-23, 112-119 | 
 | 0-47, 96-143 | 
| 7 | 0 | 24-31, 120-127 | 
 | 0-47, 96-143 | 
NOTE: in place of the --physcpubind <vCPU list> argument, users
can also pass in --cpunodebind <NUMA node ID
Script to Construct an FPGA to NUMA Node and vCPU Mapping#
Execute the following bash command to construct a table that maps the FPGA devices to their NUMA node and colocated vCPUs:
(
    printf "%-8s %-11s %-11s %-13s %-10s %s\n" "SLOT" "VENDOR_ID" "DEVICE_ID" "BDF" "NUMA_NODE" "vCPUs_(Physical,Virtual)"
    sudo fpga-describe-local-image-slots | while read -r dev slot vendor device bdf; do
        numa_node=$(sudo cat /sys/bus/pci/devices/$bdf/numa_node)
        vcpus=$(lscpu -p=CPU,NODE | grep "^[0-9]*,$numa_node$" | cut -d',' -f1)
        # Organize CPUs into ranges
        physical_cpus=$(echo "$vcpus" | awk -v ORS='' '
            function print_range(start, end) {
                if (start == end) return start;
                return start "-" end;
            }
            NR==1 {start=end=$1; prev=$1; next}
            {
                if ($1 != prev+1) {
                    printf "%s,", print_range(start, end);
                    start=$1;
                }
                end=$1;
                prev=$1;
            }
            END {printf "%s", print_range(start, end)}
        ')
        printf "%-8s %-11s %-11s %-13s %-10s %s\n" \
            "$slot" "$vendor" "$device" "$bdf" "$numa_node" "$physical_cpus"
    done
) | column -t
Sample output from a f2.48xlarge instance:
SLOT  VENDOR_ID  DEVICE_ID  BDF           NUMA_NODE  vCPUs_(Physical,Virtual)
0     0x1d0f     0x9048     0000:9f:00.0  1          48-95,144-191
1     0x1d0f     0x9048     0000:a1:00.0  1          48-95,144-191
2     0x1d0f     0x9048     0000:a3:00.0  1          48-95,144-191
3     0x1d0f     0x9048     0000:a5:00.0  1          48-95,144-191
4     0x1d0f     0x9048     0000:ae:00.0  0          0-47,96-143
5     0x1d0f     0x9048     0000:b0:00.0  0          0-47,96-143
6     0x1d0f     0x9048     0000:b2:00.0  0          0-47,96-143
7     0x1d0f     0x9048     0000:b4:00.0  0          0-47,96-143
Frequently Asked Questions (FAQ)#
How can I investigate system performance issues?#
To Investigate low performance (high latency or decreased bandwidth)#
- Verify the process accessing the FPGA is being served by the expected vCPU with - top -H -p <pid>
- Check the NUMA alignment by hand with - numactl --hardware
- Monitor the PCIe traffic on AMD processors with tools such as amd-uprof 
To Investigate inconsistent performance (latency or bandwidth spikes)#
- Ensure no other processes are using the same vCPUs with tools such as - htop
- Monitor system resources with - sar
- Verify memory allocation with - numastat
Where can I reach out for additional help?#
For any issues with the devkit documentation or code, please open a GitHub issue with all steps to reproduce.
For questions about F2 instances, please open a re:Post issue with the ‘FPGA Development’ tag.