The Great Split
Abahbob / April 2023 (902 Words, 6 Minutes)
If we want to actually compile any code, we’re going to need to organize things a bit. Let’s start with some naive file splitting.
Deciding where to split
Last time, we cleaned up literal pools. These seem like a reasonable place to try splitting our files. While it’s not a guarantee that that’s where files were split (some files may have no literals, and other files might have literals inserted mid-way), it should work well enough for our general case. We can always manually merge files later.
The following code is mostly curtesy of ChatGPT:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import re
# Change these to match your input and output files
input_file = "repo/asm/code.s"
output_prefix = "test/split_"
# Compile regex pattern for finding the start of a function
func_start_pattern = re.compile(r"\w*_func_start sub_([0-9|A-F]*)")
# Compile regex pattern for finding file boundaries
boundary_pattern = re.compile(r"_080[0-9|A-F]* DCDU \w*")
# Read in the input file
with open(input_file, "r") as f:
input_data = f.readlines()
start_line = 0
current_file = None
in_block = False
for idx, line in enumerate(input_data):
result = re.search(func_start_pattern, line)
if result and current_file == None:
current_file = result.groups()
result = re.search(boundary_pattern, line)
if result:
in_block = True
elif in_block == True and line.strip() == "":
output_file = output_file = output_prefix + current_file[0].strip() + ".s"
with open(output_file, "w") as f:
f.write("".join(input_data[start_line:idx]))
start_line = idx
current_file = None
in_block = False
This creates a lot of .s
files, which is great. What’s not great is that now we have a ton of .s
files that we’re going to have to manage.
Fixing imports
With all of the functions in one file, they were able to reference each other easily. Now we’ve got to import and export everything. We also need to add in our macros import.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import re
import os
dir_path = 'test'
res = []
# Iterate directory
for path in os.listdir(dir_path):
# check if current path is a file
if os.path.isfile(os.path.join(dir_path, path)):
res.append(path)
HEADER = '''\
INCLUDE asm/macros.inc
AREA text, CODE
'''
for f_name in res:
with open(f"test/{f_name}", 'r') as f:
base = f.readlines()
locations = []
references = []
for line in base:
if 'func_start' in line:
continue
if line.startswith(' '):
extract = r"(sub_[0-9|A-F]*)"
result = re.search(extract, line)
if not result:
continue
ref = result.groups()[0]
references.append(ref)
else:
if ' ' in line:
locations.append(line.split()[0].strip())
else:
locations.append(line.strip())
if 'DCDU' in line and 'DCDU 0x' not in line:
references.append(line.split()[-1])
imports = sorted(list(set(references) - set(locations)))
with open(f"test/{f_name}", 'r+') as f:
file_data = f.read()
f.seek(0, 0)
f.write(HEADER)
for i in imports:
f.write(f"\tIMPORT {i}\n")
f.write(file_data)
Updating our linker script
Now we have to manually include every single file in our scatter_script.txt
. There doesn’t seem to be any wildcards, and the linker does some dynamic shuffling of locations behind the scenes if the sizes aren’t perfect, so let’s just be very explicit about everything.
1
2
3
4
5
6
7
8
9
10
11
...
.text2 0x08000210
{
split_8000210.o
split_8000324.o
split_800065C.o
split_8000914.o
split_8000BAC.o
split_8000C7C.o
split_8000D64.o
...
We’re finally at a point where we’re able to start working on actual decompilation. I’ll be leaving that for next post though. There’s going to be a lot to talk about, so I’ll be leaving this post off here.