Fast USB with FTDI’s FT232H

I recently implemented fast (30-40 MB/s) USB communication between an FPGA board (Digilent’s CMOD A7, based on a Xilinx Artix-7 FPGA) and a host PC. This uses the FT245 synchronous FIFO mode of an FTDI FT232H or similar. I’m using Verilog (technically SystemVerilog, but for this I’m using very little syntax that is not also valid Verilog), but most of this post is language agnostic.

Looking online while I was implementing this, I found plenty of comments from people who got things working, but also quite a few who ran into problems (often variations on ‘sometimes a byte goes missing’), and very little in the way of concrete advice from the former as to how the latter might actually fix things. Having had some minor stumbling blocks along those lines myself, I wanted to post this summary of my own experience, in the hope that it helps someone else.

For my C64 devkit project, I needed wanted fast I/O between an FPGA and a host PC, for which USB seemed like the logical choice.

I was originally hoping to use the CMOD A7’s onboard FTDI chip, but it’s wired in such a way as to restrict usage to serial communication only which severely limits the maximum speed. While I’m sure other USB interface chips are available, I couldn’t find too many alternatives that offer simple and fast parallel interfaces. As I didn’t want to implement USB myself in HDL I decided to add a second FTDI chip on some spare GPIO pins.

In this project I used an FT232H, which is the single port version of the FT2232H used on the CMOD itself. The information below should apply to any chip that implements this protocol, but please compare against the datasheet for the chip in question – I noticed at least some minor variance in the timing constraints between each.

A lot of what I’m going to say in this post is pretty much just a translation of the datasheets and/or application notes provided by FTDI. I found these to be simultaneously very helpful and incredibly frustrating. They do, on the whole, contain everything you need to know, including some important fine details. They are also very terse, leaving a lot to be inferred from careful study when a simple extra sentence or example would be far clearer. This can result in simple misunderstandings, which in turn can lead to exactly the kind of bugs that I and seemingly others have encountered.

There are a variety of protocols supported by the FT232H – some serial, and some parallel. In most of these modes the device controlling the FT232H clocks data in and out of it, which in turn limits the maximum speed as the FT232H buffers and reclocks the data to the speed of the USB interface. For maximum bandwidth FTDI offer a synchronous transfer mode, where the FT232H itself provides the clock signal, and the external device must clock data with reference to that. This clock speed is 60Mhz which at 8 bits per-clock gives 480Mb/s – this is unsurprisingly the maximum rate of a USB2.0 high-speed connection.

Obviously some overheads will still be present – data is sent in packets with some sychronisation in between, and USB is not full duplex, so when both sending and receiving there is a small amount of turn around time. Despite this we can still see speeds of around 40MB/s for a large transfer, significantly faster that is possible with any other mode on offer.

General Tips

Firstly, let me give some simple advice about things to do and not do before actually trying to write and debug the HDL/software. This stuff might seem obvious, but people sometimes ignore simple and obvious things, and that can come back and bite them. And by people I mean me, because I have been guilty of all these things at some point.

Firstly, the datasheets for the FTDI chips give reference schematics for wiring them up. There are multiple schematics for different use cases and there are important notes included on them. Such as the one that specifies an inline resistor near to the FTDI chip on the clock pin when using FT245 synchonous mode – you’d be forgiven from missing it, it’s fairly well hidden. If you’re using an off the shelf FT232H module, check whether it’s correctly implemented or not. I’ve certainly had one module from a fairly well known designer/seller of these sort of things, in which they clearly just dropped anything that seemed redundant to them in order to fit it onto a smaller PCB. This module did not work well in FT245 mode for me, so probably these components are not actually redundant after all.

Secondly, set up constraints and tests in your HDL development environment of choice. You can get pretty far without doing that, but it will potentially save a lot of time if you just get things set up properly at the start. Iterating on HDL is painfully slow and you don’t want to be chasing a bug that your tools could have simply told you about had they known what the timing for a given signal was, or that you could’ve seen if you had a simulation that properly tried all the weird edge case situations that only happen when you think you’re finished and want to demo the thing.

In the case of the FT232H I couldn’t find any readily available simulation code, so I wrote my own. To ensure I could correctly handle running out of data/buffer-space I gave the testbench FT232H very small limits on how many bytes could be sequentially transferred before it reported that it was out of space. I also put functions into my testbench to fill up my own FIFOs so that I could test scenarios where data was piled up there. Annoyingly, at least on my version of the Xilinx tools, while you can see inside an inferred block-ram in the simulator, you cannot see the contents of an instantiated FIFO. For the purposes of testing I either just had the testbench clock all the data back out, or in some cases I just wired it up to use both ends of a single FIFO such that data arriving from USB would be sent back again verbatim. This latter approach is also synthesisable and yields a USB loopback device – handy for stress-testing the actual hardware without other modules.

Thirdly, watch out for anything that might interfere electrically. At these rates things are a bit less forgiving of noise and jitter than typical microcontroller I/O. I just made sure my board had dedicated power regulation and decoupling, and that the signals were routed as directly as I could. An earlier iteration where I paid less attention to this got pretty glitchy when an entirely unrelated circuit was active. Also not all USB cables are created equal – quite a few seem oriented towards just being ‘charging’ cables where the data lines are connected only just well enough to serve as ID for negotiating the charging current. Get a proper cable that is certified for high speed transfer. If you’re having reliability issues, swap cables and see if it helps. The CMOD itself seems pretty intolerant of poor quality cabling so I was reasonably sure that if a cable worked with that, it was OK.

Implementation

FT245 uses the most control signals of any mode offered by the FT232H chip, but is still fairly simple in principle. The IO pins we need to deal with are as follows (again, see data sheets for canonical names, pin numbers, etc.):

D0 – D7In / OutThe 8 data lines, used for both input and output, and thus tri-state.
CLKOutputThe afore-mentioned clock signal, which is a 60Mhz clock output by the FT232H – all other signals are sampled with reference to the rising edge of this clock.
RDInputThis signal tells the FT232H that data should be clocked out (i.e. from the USB to the device)
WRInputThis signal tells the F232H that data should be clocked in (i.e. from the device to the USB)
OEInputThis signal tells the FT232H whether the data pins are inputs or outputs (i.e. whether to output data on them, or leave them tri-stated so they can be inputs).
TXEOutputThis could be read as ‘transfer buffer empty’ but it’s actual meaning is a bit more nuanced. Data will only be output to the USB when this signal is low.
RXFOutputThis could be read as ‘receive buffer full’ but it really means that the buffer isn’t empty – in other words, at least one byte of data has arrived over USB and can be clocked out.
SIWUInputThis signal tells the FT232H to send data immediately and not wait for a full buffer or a timeout.

The FT245 protocol is not particularly complicated, but it is somewhat unforgiving – if you make a mistake, bytes will go missing or get sent twice. Note that most signals are inputs – the FT232H expects you to tell it what to do. It handles the USB protocol side of things and provides some FIFOs to keep data flowing, but otherwise it’s dumb as a brick. Chances are if data isn’t being transfered correctly, either you have a hardware issue (check the stuff I already mentioned), or a bug in your implementation.

While connecting the GPIO pins to the FPGA is generally fairly straightforward, pay special attention to the clock pin. This needs to be brought into a pin on the FGPA which can feed it into a proper clock management unit, and be distributed to other logic with the correct timing. It is not sufficient to just bring the clock in and buffer it, even if it is wired to a clock capable pin. If you do that, your logic will be clocked at 60Mhz, but with a delay relative to the source clock which will likely result in signals changing at entirely the wrong times. Using a clock management unit the FPGA can re-generate the clock with an appropriate offset which will result in internal logic clocking correctly at the edge of the external clock as intended. Having your constraints set up correctly will catch issues here.

For the SIWU signal, unless you really really need low latency transfers, just keep it high. I connected it to the FPGA and drove it high there, but you could just pull it up externally.

Given the rather rigid nature of the low level interface it makes sense to implement an FTDI module that communicates with the rest of our system via two FIFOs – one for data coming from USB, and one for data being sent to USB. I implemented a fairly simple state machine that will wait in an idle state until either data is waiting to be received or sent at which point it will go into either a read or write state as appropriate. Both read and write state burst transfer as much data as they can (with a throughput of one byte per cycle) before returning to idle. Read and write states are split into a number of sub-states, in order to pipeline the data.

Receiving was the easier case for me. The FT232H signals that data is available by bringing RXF low. The module responds by bringing OE low (enabling the FTDI to output data) on the next clock edge, and then bringing RD low on the one after that. On every subsequent clock edge data will be available on the data pins until the FT232H has run out. When data runs out, RXF will go high again. So from the module’s perspective it should only clock data into the ‘input’ FIFO when RXF is low, and when it is high, go back to an idle state (which means bringing both OE and RD high again).

The only slightly tricky part to this is handling the situation of our own FIFO becoming full. Fortunately we have an ‘almost full’ signal, so we can still write to the FIFO when this happens, provided we don’t keep writing for a number of cycles. We only need to write the current byte while we deassert the FT232H’s RD and OE signals and go back to idle.

Writing is a bit more involved. In principle it looks simple – we wait for the FTDI to bring RXF low, signalling that it can accept data, and then we clock data into it by putting the data onto the data pins and bringing WR low on the same cycle. This where I suspect some people come unstuck – the datasheet’s wording for how RXF works is at best a little terse, and at worst misleading. The reality is that what the signal actually does, is tell you if there was room in the buffer on the previous cycle. So if you try to write data, and then on the next clock edge see that RXF has gone high, that data did not get written! Also a strange edge case I saw in testing, was that just because RXF is low, and you don’t write data, doesn’t mean it will stay low. So even the first byte of data might go missing, even if you waited for RXF to be low before starting a write operation.

With that in mind the resulting logic for writing isn’t all that complex at first glance. Once we put data out and bring WR low, we check every cycle that RXF is low, and only advance to outputting the next byte when it is. It is safe to leave data on the bus and hold WR low even if RXF is high – as soon as it can the chip will accept the data and bring RXF low again, and it won’t drive the bus itself because we are holding OE high unless we’re trying to read.

The complexity comes from having to keep this process fed from a FIFO, which in turn has a latency for reading. When we run out of data in the FIFO we can stop writing immediately, but when the FT232H wants us to stop, even if we deassert the FIFO read immediately we have between zero and two bytes of data in flight that need to be buffered and sent when the FT232H has space again. Normally with a somewhat full FIFO there will be two bytes because we already have one more byte on the current cycle which we can’t clock out because the previous write failed, and one which turn up on the next cycle because the FIFO read signal was asserted on the previous cycle and it’s too late to stop it now. However we might have less bytes in flight if the FIFO was empty or only had a single byte left.

I could use a different FIFO which has no such latency in reading, but at least for my initial implementation I chose to add states to my state machine which buffers any last bytes and ensures they are written out before returning to an idle state.

Conclusion

The module I put together bearing all of the above in mind has proven reliable with no missing bytes and stable fast transfers. It really wasn’t complicated, it just needed some attention to detail, perseverance (and some testing) to understand the documentation, and taking the time to ensure everything was to spec.

I don’t really have a public code repository, but I’m happy to release this module (I’ve not included the simulation code, which is somewhat specific to the rest of my system) under a zlib license, so here it is:

`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////////////////////////
//
// FT245 Synchronous FIFO interface (FT232H, FT2232H, etc.)
// 
// Simple handler to attach an FTDI USB controller to FIFOs for easier usage.
//
// fromUSB and toUSB FIFOs assumed to be (or behave like) Xilinx series-7
// hardware (e.g. Artix-7), with dual-clock and no register stage.
//
// clk should derive from the FT232H generated 60Mhz clock, via appropriate
// clock re-timing and propagation.
// 
//
// Copyright (c) 2021 Jason G. Doig
//
// This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.
//
// Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
//
// 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
//
// 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
//
// 3. This notice may not be removed or altered from any source distribution.
//////////////////////////////////////////////////////////////////////////////////

module FT245Interface(
    input clk,

    output [7:0] fromUSB_fifo_data_in,
    output fromUSB_fifo_wr_enable,
    input fromUSB_fifo_almost_full,
    
    input [7:0] toUSB_fifo_data_out,
    input toUSB_fifo_empty,
    output toUSB_fifo_rd_enable,
    
    inout [7:0] ftdi_data,
    input ftdi_rxf,
    input ftdi_txe,
    output ftdi_oe,
    output ftdi_rd,
    output ftdi_wr,

    input reset
    );
    
    typedef enum reg [3:0] {IDLE,PRE_READ,READ,PRE_WRITE,WRITE,W1,W2,W3,W4} cmdState;
    cmdState curState;

    reg ftdi_oe_buf;
    reg ftdi_rd_buf;
    reg ftdi_wr_buf;
    reg [7:0] ftdi_data_buf;

    reg [7:0] dataArray[2];
    reg [7:0] arrayCount;

    reg [7:0] fromUSB_fifo_data_in_buf;
    reg fromUSB_fifo_wr_enable_buf;
    reg toUSB_fifo_rd_enable_buf;    
    
    assign ftdi_oe = ftdi_oe_buf;
    assign ftdi_rd = ftdi_rd_buf;
    assign ftdi_wr = ftdi_wr_buf;

    assign ftdi_data = ftdi_wr_buf ? 8'bzzzzzzzz : ftdi_data_buf;
    
    assign fromUSB_fifo_wr_enable = fromUSB_fifo_wr_enable_buf;
    assign fromUSB_fifo_data_in = fromUSB_fifo_data_in_buf;    
    assign toUSB_fifo_rd_enable = toUSB_fifo_rd_enable_buf;
    
    initial begin
        curState = IDLE;

        ftdi_oe_buf = 1;
        ftdi_rd_buf = 1;
        ftdi_wr_buf = 1;
        ftdi_data_buf = 0;

        arrayCount = 0;
        dataArray[0] = 0;
        dataArray[1] = 0;

        fromUSB_fifo_data_in_buf = 0;
        fromUSB_fifo_wr_enable_buf = 0;
        toUSB_fifo_rd_enable_buf = 0;
    end
    
    always @(posedge clk) begin
        if(reset) begin
            curState <= IDLE;

            ftdi_oe_buf <= 1;
            ftdi_rd_buf <= 1;
            ftdi_wr_buf <= 1;
            ftdi_data_buf <= 0;

            arrayCount <= 0;
            dataArray[0] <= 0;
            dataArray[1] <= 0;

            fromUSB_fifo_data_in_buf <= 0;
            fromUSB_fifo_wr_enable_buf <= 0;
            toUSB_fifo_rd_enable_buf <= 0;
            
        end else begin
            case(curState)
                IDLE: begin
                    fromUSB_fifo_wr_enable_buf <= 0;
                    toUSB_fifo_rd_enable_buf <= 0;
                    ftdi_wr_buf <= 1;
                    if(ftdi_txe == 0 && toUSB_fifo_empty == 0) begin
                        toUSB_fifo_rd_enable_buf <= 1;
                        curState = PRE_WRITE;
                    end else if(ftdi_rxf == 0 && fromUSB_fifo_almost_full == 0) begin
                        ftdi_oe_buf <= 0;
                        curState <= PRE_READ;
                    end 
                end

                PRE_READ: begin
                    ftdi_rd_buf <= 0;
                    curState <= READ;
                end
                READ: begin
                    if(ftdi_rxf == 0) begin
                        fromUSB_fifo_data_in_buf <= ftdi_data;
                        fromUSB_fifo_wr_enable_buf <= 1;
                    end else begin
                        fromUSB_fifo_wr_enable_buf <= 0;
                    end
                    if(ftdi_rxf == 1 || fromUSB_fifo_almost_full == 1) begin
                        ftdi_rd_buf <= 1;
                        ftdi_oe_buf <= 1;
                        curState <= IDLE;
                    end
                end

                PRE_WRITE: begin
                    curState <= WRITE;
                    arrayCount <= 0;
                end
                WRITE: begin                    
                    if(ftdi_txe == 0) begin
                        ftdi_data_buf <= toUSB_fifo_data_out;
                        ftdi_wr_buf <= 0;
                        if(toUSB_fifo_empty) curState <= W4;
                    end else begin
                        ftdi_wr_buf <= 1;
                        dataArray[0] <= toUSB_fifo_data_out;
                        arrayCount <= 1;
                        toUSB_fifo_rd_enable_buf <= 0;
                        if(toUSB_fifo_empty) curState <= W2;
                        else curState <= W1;
                    end
                end

                W1: begin
                    dataArray[1] <= toUSB_fifo_data_out;
                    arrayCount <= 2;
                    curState <= W2;
                end

                W2: begin
                    ftdi_wr_buf <= 0;
                    if(ftdi_txe == 0 && ftdi_wr_buf == 0) begin
                        ftdi_data_buf <= dataArray[0];
                        if(arrayCount == 2) curState <= W3;
                        else curState <= W4;
                    end                    
                end

                W3: begin
                    if(ftdi_txe == 0) begin
                        ftdi_data_buf <= dataArray[1];
                        curState <= W4;
                    end                    
                end

                W4: begin
                    if(ftdi_txe == 0) begin
                        ftdi_wr_buf <= 1;
                        curState <= IDLE;                        
                    end
                end

                default: curState <= IDLE;
            endcase
        end
    end
    
endmodule