550 lines
18 KiB
Python
550 lines
18 KiB
Python
"""Assembler for the hack computer
|
|
|
|
Usage:
|
|
|
|
python assembler.py file
|
|
|
|
Loads an assembly file and translate it into machine language for the hack
|
|
computer as specified in project 6 of the nand2tetris course.
|
|
|
|
# Assembler implementation details
|
|
This assembler works in 3 steps:
|
|
1. Load and clean the assembly file
|
|
2. Construct a symbol table referencing the user-defined labels and variables
|
|
3. Translate the asm file to binary code
|
|
The file follows this pattern:
|
|
File parsing (line 147)
|
|
Symbol table (line 207)
|
|
Assembling (line 322)
|
|
??
|
|
|
|
# Assembly language specifications:
|
|
|
|
## Note on registers
|
|
The Hack computer has three 16-bit registers: the D-, A- and M-registers.
|
|
The D-register is used to store "data" that can be used as input for the ALU.
|
|
The A-register can be used in the same way but it also has a second role:
|
|
The RAM use it as address input. So any read/write instruction to the RAM is
|
|
done on the register which has the value of the A-register as address.
|
|
The M-register represents this register in the RAM that is 'pointed' onto by the
|
|
A-register. Reading/writing to it consists actually in reading/writing to
|
|
the RAM.
|
|
|
|
## Assembly instructions
|
|
There are three types of assembly instructions: A-instructions, C-instructions
|
|
and labels. Indents and blanks are ignored. Comments can only be in-line, start
|
|
with "//" and are ignored.
|
|
|
|
## A-instructions
|
|
- `"@" integer` where integer is a number in the range 0->32768. Sets the A
|
|
register to contain the specified integer. Ex: @42
|
|
- `"@" label` where label is a user-defined label. Sets the A register to
|
|
contain the code address corresponding to the label.
|
|
Labels are upper-cased by convention, with "_" as word separator. Ex: @MAIN
|
|
- `"@" variable` where variable is a user-defined variable. Sets the A register
|
|
to contain the RAM adress corresponding to the variable. If a variable is
|
|
encountered for the first time, it is automatically assigned an address.
|
|
The address assignment starts at RAM address 16 and increments.
|
|
Variables are lowercased by convention, with "_" as word separator. Ex: @i
|
|
|
|
## C-instructions
|
|
`(Dest-code "=")? op-code (";" jump-code)?`
|
|
- op-code:
|
|
Only the op-code is mandatory. It represents an instruction to be performed
|
|
by the ALU. Available codes and their associated outputs are:
|
|
- 0 -> the constant 0
|
|
- 1 -> the constant 1
|
|
- -1 -> the constant -1
|
|
- D -> the value contained in the D-register
|
|
- A -> the value contained in the A-register
|
|
- M -> the value contained in the M-Register
|
|
- !D -> bit-wise negation of the D-register
|
|
- !A -> bit-wise negation of the A-register
|
|
- !M -> bit-wise negation of the M-register
|
|
- -D -> numerical negation of the D-register using 2's complement
|
|
- -A -> numerical negation of the A-register using 2's complement
|
|
- -M -> numerical negation of the M-register using 2's complement
|
|
- D+1 -> 1 + value of the D-register
|
|
- A+1 -> 1 + value of the A-register
|
|
- M+1 -> 1 + value of the M-register
|
|
- D-1 -> -1 + value of the D-register
|
|
- A-1 -> -1 + value of the A-register
|
|
- M-1 -> -1 + value of the M-register
|
|
- D+A -> value of the D-register + value of the A-register
|
|
- D+M -> value of the D-register + value of the M-register
|
|
- D-A -> value of the D-register - value of the A-register
|
|
- D-M -> value of the D-register - value of the M-register
|
|
- A-D -> value of the A-register - value of the D-register
|
|
- M-D -> value of the M-register - value of the D-register
|
|
- D&A -> bit-wise AND of the values of the D and A registers
|
|
- D&M -> bit-wise AND of the values of the D and M registers
|
|
- D|A -> bit-wise OR of the values of the D and A registers
|
|
- D|M -> bit-wise OR of the values of the D and M registers
|
|
- dest-code:
|
|
If specified, should be followed with a "=" character. Available codes are:
|
|
- D -> write the ALU instruction's output to the D-register
|
|
- A -> write the ALU instruction's output to the A-register
|
|
- M -> write the ALU instruction's output to the M-register
|
|
- AD -> write the ALU instruction's output to the A- and D-registers
|
|
- AM -> write the ALU instruction's output to the A- and M-registers
|
|
- MD -> write the DLU instruction's output to the D- and M-registers
|
|
- ADM -> write the DLU instruction's output to the A-, D- and M-registers
|
|
- jump-code:
|
|
If specified, should be preceded by a ";" character. The computer is fed
|
|
with a programm containing one binary instruction per line. Each of those
|
|
instructions should be seen as having a number, starting at 0 and increasing
|
|
by one. The jump-code lets the computer jump to the instruction of which the
|
|
address is contained in the A-register if the result of the current
|
|
operation satisfies a certain condition. Available codes and corresponding
|
|
conditions are:
|
|
- JEQ -> jump if the output is equal to 0
|
|
- JLT -> jump if the output is lower than 0
|
|
- JLE -> jump if the output is lower than 0 or equal to 0
|
|
- JGT -> jump if the output is greater than 0
|
|
- JGE -> jump if the output is greater than 0 or equal to 0
|
|
- JNE -> jump if the output is not 0
|
|
- JMP -> just jump wathever the output
|
|
- Examples:
|
|
@3 // Set A to 3
|
|
0;JMP // unconditional jump to code line 3.
|
|
@42 // Set A to 42
|
|
D=D-A;JEQ: // Set D to D-A. if D-A == 0, jump to code line nb 42.
|
|
@i // Point onto var i, the real RAM address is handled by the assembler
|
|
M=A // Set corresponding value to it's own address
|
|
A=A+1 // Point to the RAM address just after i
|
|
|
|
## Labels
|
|
`"(" LABEL_NAME ")"`
|
|
When performing a jump, the appropriate line of code should be put in the
|
|
A-register. Setting directly the line number with a `@integer` instruction
|
|
is delicate since one has to figure out the line number ignoring comments,
|
|
blank lines, etc... And all the addresses have to be updated if the beginning of
|
|
the assembly code is edited afterward.
|
|
So the assembly language proposes to mark lines with a label using the `(LABEL)`
|
|
syntax. The assembler will then automatically adjust any `@LABEL` instruction
|
|
to match the desired code line at assembly time.
|
|
Example:
|
|
// This code runs a loop 42 times and then stops in an infinite empty loop
|
|
00 @MAIN // @2
|
|
01 0;JMP
|
|
(MAIN)
|
|
02 @42 // Set D to 42
|
|
03 D=A
|
|
04 @DECREMENT // @6
|
|
05 0;JMP
|
|
(DECREMENT)
|
|
06 D=D-1 // Decrement D
|
|
07 @END // @11
|
|
08 D;JEQ // Go there if D==0
|
|
09 @DECREMENT // Or continue the loop
|
|
10 0;JMP
|
|
(END)
|
|
11 @END // Infinity loop to end the programm
|
|
12 0;JMP
|
|
"""
|
|
|
|
import sys
|
|
from pathlib import Path
|
|
import re
|
|
#import pdb
|
|
from icecream import ic
|
|
|
|
##############
|
|
# File parsing
|
|
##############
|
|
|
|
|
|
def load_asm_file():
|
|
"""Read the asm file and preprocess it
|
|
|
|
The path to the file is treated as a global variable.
|
|
Preprocessing includes:
|
|
- Remove carriage returns and split file on new lines
|
|
- Remove comments and blanks in the lines
|
|
- Remove empty lines
|
|
"""
|
|
def read_lines():
|
|
"""Read the file given as script argument and split on new lines
|
|
|
|
Carriage returns are removed.
|
|
"""
|
|
#ic(Path(sys.argv[1]).expanduser().read_text().replace(
|
|
# '\r', '').split('\n'))
|
|
return Path(sys.argv[1]).expanduser().read_text().replace(
|
|
'\r', '').split('\n')
|
|
|
|
def filter_comment_and_blank_in_lines(lines):
|
|
"""Remove blanks and trailing comments from each line
|
|
|
|
Anything inside a line, after a "//" sequence is a comment.
|
|
"""
|
|
def filter_comment_and_blank_in_line(l):
|
|
#ic(re.sub('\s', '', l).split('//')[0])
|
|
return re.sub('\s', '', l).split('//')[0]
|
|
#ic([filter_comment_and_blank_in_line(l) for l in lines])
|
|
return [filter_comment_and_blank_in_line(l) for l in lines]
|
|
|
|
def remove_empty_lines(file):
|
|
#ic("remove_empty_lines ",[ l for l in file if len(l) > 0])
|
|
return [l for l in file if len(l) > 0]
|
|
|
|
#ic("vor remove empty lines")
|
|
|
|
#ic(remove_empty_lines(filter_comment_and_blank_in_lines(read_lines())))
|
|
|
|
return remove_empty_lines(
|
|
filter_comment_and_blank_in_lines(
|
|
read_lines()))
|
|
|
|
|
|
def is_label(line):
|
|
#ic(line)
|
|
"""Recognise "label" declarations
|
|
|
|
A label is an line in the form `"(" LABEL_NAME ")"`
|
|
"""
|
|
#ic(line.startswith('(') and line.endswith(')'))
|
|
return line.startswith('(') and line.endswith(')')
|
|
|
|
|
|
def extract_label_name(label_declaration):
|
|
#ic("extract_label_name ", label_declaration.strip('()'))
|
|
"""Extract the label name from a label instruction"""
|
|
return label_declaration.strip('()')
|
|
|
|
|
|
def is_a_instruction(line):
|
|
"""Recognise "A-instructions"
|
|
|
|
An A-instruction starts with "@"
|
|
"""
|
|
#ic("is_a_instruction ",line.startswith('@'))
|
|
return line.startswith('@')
|
|
|
|
|
|
##############
|
|
# Symbol table
|
|
##############
|
|
|
|
|
|
def default_symbol_table():
|
|
"""Construct a symbol table containing the pre-defined variables
|
|
|
|
Those variables are:
|
|
- SP: VM stack-pointer, RAM[0]
|
|
- LCL: VM local variable pointer, RAM[1]
|
|
- ARG: VM function argument pointer, RAM[2]
|
|
- THIS: VM object pointer, RAM[3]
|
|
- THAT: VM array pointer, RAM[4]
|
|
- SCREEN: base address for the screen memory-map, RAM[0x4000]
|
|
- KBD: address of the keyboard memory-map, RAM[0x6000]
|
|
- R0 -> R15: Shortcuts for the first 16 RAM locations
|
|
"""
|
|
return {
|
|
key: (value)
|
|
for key, value in {
|
|
**{'@SP': 0,
|
|
'@LCL': 1,
|
|
'@ARG': 2,
|
|
'@THIS': 3,
|
|
'@THAT': 4,
|
|
'@SCREEN': 0x4000,
|
|
'@KBD': 0x6000,},
|
|
**{f'@R{i}': i
|
|
for i in range(16)}
|
|
}.items()}
|
|
|
|
|
|
def inc_p_c(line, program_counter):
|
|
#ic("inc_p_c ", line, program_counter )
|
|
"""Increment `program_counter` if `line` is an instruction"""
|
|
if is_label(line):
|
|
return program_counter
|
|
return program_counter + 1
|
|
|
|
|
|
def insert_into(symbol_table, label, value):
|
|
"""Return a copy of `symbol_table` with the `label: value` pair added"""
|
|
#ic("insert_into ",{**symbol_table, **{label: value}})
|
|
return {**symbol_table,
|
|
**{label: value}}
|
|
|
|
|
|
def add_label(label, symbol_table, program_counter):
|
|
"""Add a label to the symbol table"""
|
|
if label in symbol_table:
|
|
raise ValueError(f'Duplicate attemp at '
|
|
f'declaring label {label} '
|
|
f'before line {program_counter+1}')
|
|
#ic("0--- ", symbol_table, '@' + label, program_counter)
|
|
return insert_into(symbol_table, '@' + label, program_counter)
|
|
|
|
|
|
def find_and_add_labels(line, program_counter, symbol_table):
|
|
#ic("1 --- ", line, program_counter, symbol_table)
|
|
"""Look for a label declaration in `line` and add it to the symbol table"""
|
|
if is_label(line):
|
|
return add_label(extract_label_name(line), symbol_table,
|
|
program_counter)
|
|
return symbol_table
|
|
|
|
|
|
def add_user_labels(asm_lines, program_counter, symbol_table):
|
|
#ic("2 --- ", asm_lines,program_counter,symbol_table)
|
|
"""Add the user-defined labels in `asm_lines` to `symbol_table`"""
|
|
for line in asm_lines:
|
|
symbol_table = find_and_add_labels(line, program_counter,
|
|
symbol_table)
|
|
program_counter = inc_p_c(line, program_counter)
|
|
return symbol_table
|
|
|
|
|
|
def is_int(string):
|
|
"""Test if string is an int"""
|
|
try:
|
|
int(string)
|
|
return True
|
|
except ValueError:
|
|
return False
|
|
|
|
|
|
def find_and_add_variables(line, variable_counter, symbol_table):
|
|
"""Recognise if line declares a new variable and add it to `symbol_table`
|
|
|
|
This function assumes that labels have already been added to the
|
|
symbol-table. So any `@var` instruction where `var` is not in `symbol_table`
|
|
is a new variable.
|
|
"""
|
|
if not is_a_instruction(line):
|
|
return variable_counter, symbol_table
|
|
if line in symbol_table:
|
|
return variable_counter, symbol_table
|
|
if is_int(line[1:]):
|
|
return variable_counter, insert_into(symbol_table, line,
|
|
int(line[1:]))
|
|
return variable_counter+1, insert_into(symbol_table, line,
|
|
variable_counter)
|
|
|
|
|
|
def add_user_variables(asm_lines, variable_counter, symbol_table):
|
|
"""Add the user-defined variables to the symbol_table
|
|
|
|
This function assumes that labels have already been added to `symbol_table`.
|
|
"""
|
|
for line in asm_lines:
|
|
variable_counter, symbol_table = find_and_add_variables(line,
|
|
variable_counter, symbol_table)
|
|
return symbol_table
|
|
|
|
|
|
############
|
|
# Assembling
|
|
############
|
|
|
|
|
|
def int_to_binary(integer, bits=15):
|
|
"""Convert an integer to it's binary representation, as string"""
|
|
if bits < 0:
|
|
return ''
|
|
high_bit_value = 2**bits
|
|
return (
|
|
'1' if integer >= high_bit_value else '0'
|
|
) + int_to_binary(integer % high_bit_value, bits-1)
|
|
|
|
|
|
def get_dest(c_instruction):
|
|
"""Return the destination part of a c-instruction
|
|
|
|
C-instruction format:
|
|
`(Dest-code "=")? op-code (";" jump-code)?`
|
|
"""
|
|
return c_instruction.split('=')[0] if '=' in c_instruction else ''
|
|
|
|
|
|
def assemble_dest(dest):
|
|
"""Convert an assembly destination to its binary counterpart
|
|
|
|
The legal assembly destinations are: A, D, M, AD, AM, MD, ADM.
|
|
The binary representation of the destination is:
|
|
X X X
|
|
^ ^ ^
|
|
| | Write to M
|
|
| | ----------
|
|
| Write to D
|
|
| ----------
|
|
Write to A
|
|
"""
|
|
if dest not in ['', "A", "D", "M", "AD", "AM", "MD", "ADM"]:
|
|
raise ValueError(f"Unrecognised c-instruction destination: '{dest}'")
|
|
return (
|
|
('1' if 'A' in dest else '0') +
|
|
('1' if 'D' in dest else '0') +
|
|
('1' if 'M' in dest else '0'))
|
|
|
|
|
|
def get_jump(c_instruction):
|
|
"""Return the jump part of a c-instruction
|
|
|
|
C-instruction format:
|
|
`(Dest-code "=")? op-code (";" jump-code)?`
|
|
"""
|
|
return c_instruction.split(';')[1] if ';' in c_instruction else ''
|
|
|
|
|
|
def assemble_jump(jump):
|
|
"""Convert an assembly jump code to its binary counterpart
|
|
|
|
The legal assembly destinations are: JMP, JEQ, JLT, JLE, JGT, JGE, JNE
|
|
The binary representation of the jump is:
|
|
X X X
|
|
^ ^ ^
|
|
| | Jump if output is greater than 0
|
|
| | --------------------------------
|
|
| Jump if output is equal to 0
|
|
| ----------------------------
|
|
Jump if output is lower than 0
|
|
"""
|
|
if len(jump) == 0:
|
|
return '000'
|
|
if jump not in ["JMP", "JEQ", "JLT", "JLE", "JGT", "JGE", "JNE"]:
|
|
raise ValueError(f"Unrecognized jump instruction: {jump}")
|
|
if jump == 'JMP':
|
|
return '111'
|
|
return (
|
|
str(1 * ('L' in jump or jump == 'JNE')) +
|
|
str(1 * ('E' in jump and jump != 'JNE')) +
|
|
str(1 * ('G' in jump or jump == 'JNE'))
|
|
)
|
|
|
|
|
|
def get_op_code(c_instruction):
|
|
"""Return the op-code part of a c-instruction
|
|
|
|
C-instruction format:
|
|
`(Dest-code "=")? op-code (";" jump-code)?`
|
|
"""
|
|
if '=' in c_instruction:
|
|
return c_instruction.split('=')[1].split(';')[0]
|
|
return c_instruction.split(';')[0]
|
|
|
|
|
|
def assemble_op_code_no_M(op_code):
|
|
#ic(op_code)
|
|
"""Convert an assembly op code to its binary counterpart
|
|
|
|
Note that this method assumes that the A/M switch is made. It
|
|
will thus only recognise operations on the A register. Any "M" has to be
|
|
replaced with "A".
|
|
The legal assembly destinations are: 0, 1, -1, D, M, !D, !A, -D, -A, D+1,
|
|
A+1, D-1, A-1, D+A, D-A, A-D, D&A, D|A.
|
|
The binary representation of the op-code is:
|
|
X X X X X X
|
|
^ ^ ^ ^ ^ ^
|
|
| | | | | Flip the output bits
|
|
| | | | | --------------------
|
|
| | | | operation switch (0->`AND`, 1->`+`)
|
|
| | | | -----------------------------------
|
|
| | | flip the A/M input's bits
|
|
| | | -------------------------
|
|
| | Zero the A/M input
|
|
| | ------------------
|
|
| Flip the D input's bits
|
|
| -----------------------
|
|
Zero the D input
|
|
"""
|
|
if op_code == '0':
|
|
return '101010'
|
|
if op_code == '1':
|
|
return '111111'
|
|
if op_code == '-1':
|
|
return '111010'
|
|
if op_code == 'D':
|
|
return '001100'
|
|
if op_code == 'A':
|
|
return '110000'
|
|
if op_code == '!D':
|
|
return '001101'
|
|
if op_code == '!A':
|
|
return '110001'
|
|
if op_code == '-D':
|
|
return '001111'
|
|
if op_code == '-A':
|
|
return '110011'
|
|
if op_code == 'D+1':
|
|
return '011111'
|
|
if op_code == 'A+1':
|
|
return '110111'
|
|
if op_code == 'D-1':
|
|
return '001110'
|
|
if op_code == 'A-1':
|
|
return '110010'
|
|
if op_code == 'D+A':
|
|
return '000010'
|
|
if op_code == 'D-A':
|
|
return '010011'
|
|
if op_code == 'A-D':
|
|
return '000111'
|
|
if op_code == 'D&A':
|
|
return '000000'
|
|
if op_code == 'D|A':
|
|
return '010101'
|
|
raise ValueError(f'Unrecognized op code: {op_code}')
|
|
|
|
|
|
def assemble_op_code(op_code):
|
|
#ic("assemble_op_code ", op_code)
|
|
"""Assemble the A/M switch and the op-code"""
|
|
return ('1' if 'M' in op_code else '0') + \
|
|
assemble_op_code_no_M(op_code.replace('M', 'A'))
|
|
|
|
|
|
def assemble_c_instruction(c_instruction):
|
|
#ic("assemble_c_instruction ", c_instruction)
|
|
"""Assemble a c-instruction
|
|
|
|
The binary representation of a c-instruction is
|
|
111 a.ffff.ff dd.d jjj
|
|
^^^ ^ ^^^^ ^^ ^^ ^ ^^^
|
|
||| | |||| || || | jump instruction
|
|
||| | |||| || || | ----------------
|
|
||| | |||| || Destination instruction
|
|
||| | |||| || -----------------------
|
|
||| | Operation instruction
|
|
||| | ---------------------
|
|
||| A/M switch (0->A, 1->M)
|
|
||| -----------------------
|
|
c-instruction marker
|
|
"""
|
|
return (
|
|
'111' +
|
|
assemble_op_code(get_op_code(c_instruction)) +
|
|
assemble_dest(get_dest(c_instruction)) +
|
|
assemble_jump(get_jump(c_instruction)))
|
|
|
|
|
|
def assemble_line(line, symbol_table):
|
|
"""Recognize if a line is a label, A- or C-instruction and assemble it"""
|
|
if is_label(line):
|
|
return ''
|
|
if is_a_instruction(line):
|
|
return int_to_binary(symbol_table[line]) + '\n'
|
|
return assemble_c_instruction(line) + '\n'
|
|
|
|
|
|
|
|
def assemble_lines(asm_lines, symbol_table):
|
|
return ''.join([
|
|
assemble_line(line, symbol_table)
|
|
for line in asm_lines
|
|
])
|
|
|
|
|
|
asm_file = load_asm_file()
|
|
|
|
print(assemble_lines(asm_file,
|
|
add_user_variables(asm_file, 16,
|
|
add_user_labels(asm_file, 0, default_symbol_table()))), end='')
|
|
|
|
|
|
#print(default_symbol_table())
|