Assignment 3

Due: 1:25pm, Wed Sept 21st, 2022

Note: Make reasonable assumptions where necessary and clearly state them. Feel free to discuss problems with classmates, but the only written material that you may consult while writing your solutions are the textbook and lecture slides/videos. Solutions should be uploaded as a single pdf file on Canvas. Show your solution steps so you receive partial credit for incorrect answers and we know you have understood the material. Don't just show us the final answer.

Every homework has an automatic penalty-free 1.5 day extension to accommodate any covid/family-related disruptions. In other words, try to finish your homework by Wednesday 1:25pm to keep up with the lecture content, but if necessary, you may take until Thursday 11:59pm.

  1. Data Dependences (30 points)

    Consider a 32-bit in-order pipeline that has the following stages. Note the many differences from the examples in class: a stage that converts CISC instructions to micro-ops, one stage to do register reads, one stage to do register writes, four stages to access the data memory, and three stages for the FP-ALU. For the questions below, assume that each CISC instruction is simple and is converted to a single micro-op.

    Fetch uOp Convert Decode Regread IntALU Regwrite
    IntALU Datamem1 Datamem2 Datamem3 Datamem4 Regwrite
    FPALU1 FPALU2 FPALU3 Regwrite

    After instruction fetch, the instruction goes through the micro-op conversion stage, a Decode stage where dependences are analyzed, and a Regread stage where input operands are read from the register file. After this, an instruction takes one of three possible paths. Int-adds go through the stages labeled "IntALU" and "Regwrite". Loads/stores go through the stages labeled "IntALU", "Datamem1", "Datamem2", "Datamem3", "Datamem4", and "Regwrite". FP-adds go through the stages labeled "FPALU1", "FPALU2", "FPALU3", and "Regwrite". Assume that the register file has an infinite number of write ports so stalls are never introduced because of structural hazards. How many stall cycles are introduced between the following pairs of successive instructions (i) for a processor with no register bypassing and (ii) for a processor with full bypassing?

    1. Int-add, followed by a dependent Int-add
    2. Load, followed by a dependent FP-add
    3. Load, providing the address operand for a store
    4. FP-add, providing the data operand for a store

  2. Branch delay slot and stalls (30 points)

    Consider the following skeletal code segment, where the branch is taken 90% of the time and not-taken 10% of the time.

    Consider a 10-stage in-order processor, where the instruction is fetched in the first stage, and the branch outcome is known after three stages. Estimate the average CPI of the processor under the following scenarios (assume that all stalls in the processor are branch-related and branches account for 15% of all executed instructions):

    1. On every branch, fetch is stalled until the branch outcome is known.
    2. Every branch is predicted not-taken and the mis-fetched instructions are squashed if the branch is taken.
    3. The processor has two delay slots and the two instructions following the branch are always fetched and executed, and
      1. You are unable to find any instructions to fill the delay slots.
      2. You are able to move two instructions before the branch into the delay slots.
      3. You are able to move two instructions from the taken block into the delay slots.
      4. You are able to move two instructions from the not-taken block into the delay slots.

  3. Deep Pipelines (40 points)

    Consider an unpipelined processor where it takes 8ns to go through the circuits and 0.2ns for the latch overhead. Assume that the Point of Production and Point of Consumption in the unpipelined processor are separated by 4ns. Assume that one-third of all instructions do not introduce a data hazard and two-thirds of all instructions depend on their preceding instruction. What is the throughput of the processor (in BIPS) for (i) an unpipelined processor, (ii) a 10-stage pipeline, and (iii) a 20-stage pipeline.