5

My colleagues and I are working with one of our PCIe-based products, and we've discovered that some kind of platform/chipset dependency is preventing interrupts from being delivered to our linux kernel driver (rapafp). One older version of the product that we have to continue to support in the field was sorta retrofit from an older PCI design. So what we've got is some FPGAs, one of which has a 66MHz PCI-32 interface, and that connects to a Texas Instruments XIO PCI-to-PCIe bridge. I should note that I've been researching this tirelessly for days, and I'm just not getting anywhere. We have definitely considered hardware problems with our own device, but we've swapped out multiple cards, and it doesn't make any difference.

Reference system that works

We have a system running RHEL6.5 that works great, so we're using that as a reference. Below is some info about the platform. I don't know what level of detail you will need, and I don't want to write a spammy question. Please let me know what else would be useful to provide and how (inline in the question, pastebin, etc.).

From uname -a:

Linux DL-2-107.localdomain 2.6.32-431.el6.i686 #1 SMP Fri Nov 22 00:26:36 UTC 2013 i686 i686 i386 GNU/Linux

From /proc/interrupts:

           CPU0       CPU1       
...
 16:  609672457 1344098703   IO-APIC-fasteoi   uhci_hcd:usb3, pata_jmicron, rapafp    

Info from dmesg:

rapafp driver version 3.3.0.5
rapafp: Requesting IRQ 16
TSI: rapafp0 (BusID 2:0:0) is RAPTOR 4000 @ 2048x2048
TSI: rapafp1 (BusID 2:0:0) is RAPTOR 4000 @ 1280x1024

From lspci:

# lspci -t
-[0000:00]-+-00.0
           +-01.0-[01-02]----00.0-[02]----00.0

00:01.0 PCI bridge: Intel Corporation 82Q35 Express PCI Express Root Port (rev 02) (prog-if 00 [Normal decode])
01:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge (rev 03) (prog-if 00 [Normal decode])
02:00.0 Display controller: Tech-Source Device 0042

CPU installed is: model name : Intel(R) Core(TM)2 CPU E8400 @ 3.00GHz

Some BIOS info from dmidecode:

Vendor: Phoenix Technologies, LTD
Version: 6.00 PG
Release Date: 12/12/2008

Note that the driver was never written with fasteoi in mind, so it never makes any end-of-interrupt calls. Nevertheless, it works flawlessly on that machine.

System that can't get any interrupts to our driver

We have two systems with problems receiving interrupts. One is running RHEL6.5 (2.6.32-431.el6.i686), and the other is RHEL7.4 (3.10.0-693.17.1.el7.x86_64).

The RHEL6 system is able to get interrupts to our driver, but only intermittently. This is likely due to the kernel connecting the device to an edge-triggered interrupt line (despite the driver requesting otherwise!) and the driver not being written to be compatible with edge-triggering.

The RHEL7 system isn't able to get interrupts to our driver at all. Our current objective is to port the driver to RHEL7, so I'll focus on that machine. The hosts share a lot of similarities with each other and differences from the reference system. The main differences that matter are kernel version, 32-bit vs. 64-bit, and possibly BIOS. To start with, below is some system info.

From uname -a:

Linux rhel74.techsource.com 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

/proc/interrupts:

10:          0          0   IO-APIC-edge      rapafp

From dmesg:

[321790.744110] raptor_attach: irq_set_irq_type(10,8) succeeded!
[321790.744111] raptor_attach: calling request_irq.
[321790.744239] raptor_attach: request_irq(10) succeeded!
[321790.744240] raptor_attach: done
[321790.744342] TSI: rapafp0 (BusID 2:0:0) is RAPTOR 4000 @ 2048x2048
...
[321807.840300] PCI Config Register dump:
[321807.840405]  vendor id              0x1227  
[321807.840508]  device id              0x43
[321807.840611]  command register       0x202   
[321807.840715]  status register        0x2a0
[321807.840818]  revision id            0x0     
[321807.840921]  programming class code 0x0
[321807.841025]  sub-class code         0x80    
[321807.841129]  basic class code       0x3
[321807.841232]  header type            0x0     
[321807.841335]  base register 0        0xbfff0008
[321807.841439]  base register 1        0xa0000008      
[321807.841542]  base register 2        0xb8000008
[321807.841645]  base register 3        0x0     
[321807.841749]  base register 4        0xbffc0008
[321807.841852]  base register 5        0x0     
[321807.841955]  Cardbus CIS Pointer    0x0
[321807.842059]  Subsystem Vendor ID    0x1227  
[321807.842162]  Subsystem ID           0x43
[321807.842266]  ROM base register      0x0     
[321807.842369]  interrupt line         0xa
[321807.842472]  interrupt pin          0x1     
[321807.842576]  minimum grant          0x0
[321807.842679]  maximum grant          0x0

Info from lspci:

# lspci -t
-[0000:00]-+-00.0
           +-01.0-[01-02]----00.0-[02]----00.0

00:00.0 Host bridge: Intel Corporation 82X38/X48 Express DRAM Controller (rev 01)
        Subsystem: Holco Enterprise Co, Ltd/Shuttle Computer Device 3111
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
...
00:01.0 PCI bridge: Intel Corporation 82X38/X48 Express Host-Primary PCI Express Bridge (rev 01) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 24
...
01:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge (rev 03) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
...
02:00.0 Display controller: Tech-Source Device 0043
        Subsystem: Tech-Source Device 0043
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B+ DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 10

Solutions attempted

There is a sequence of fixes I attempted. The first thing I did was go through the interrupt handling code and rewrite it so that it should be friendlier to an edge-triggered interrupt line, but that had no effect. Other things I did include:

  • There had been no call to pci_enable_device, so I added that. No effect.
  • I noticed that our call to request_irq was using legacy flags starting with SA_, so I replaced them with the newer ones starting with IRQF_. I tried all sorts of combinations of flags. IRQF_TRIGGER_RISING, IRQF_TRIGGER_FALLING, IRQF_TRIGGER_HIGH, IRQF_TRIGGER_LOW, combinations of those, with and without IRQF_SHARED, etc. None of these had any impact on IRQ delivery, what was reported by /proc/interrupts, or the bridge configurations reported by lspci. Nevertheless, request_irq never returned any error codes.
  • I tried calling enable_irq and set_irq_type. No matter what I passed to them, there was no effect. No error codes returned.

Eventually I noticed that the PCI bridge 00:01.0 had legacy interrupts (DisINTx+). I went hunting around for some kind of pre-existing function that would traverse the bridge hierarchy and fix up interrupts on all of them, but I couldn't find anything. So I decided to try experimenting.

First, I wrote my own function that would ascend the bridge hierarchy:

static int raptor_enable_intx(struct pci_dev *dev, TspciPtr pTspci) {
    int num_en = 0;
    int result;
    u16 cmd, old_cmd;

    while (dev) {
        pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
        pci_intx(dev, true);
        pci_read_config_word(dev, PCI_COMMAND, &cmd);
        if (cmd & PCI_COMMAND_INTX_DISABLE) {
            printk (KERN_INFO "raptor_enable_intx: Could not clear DisINTx for device %s\n", pci_name(dev));
        } else {
            printk (KERN_INFO "raptor_enable_intx: Successfully cleared DisINTx for device %s\n", pci_name(dev));
            if ((old_cmd & PCI_COMMAND_INTX_DISABLE)) num_en++;
        }

        dev = pci_upstream_bridge(dev);
    }
    return num_en;
}

The main effect that this had was to cause the machine to hang, although not right away. I've tried calling request_irq before or after raptor_enable_intx. IIRC, one had no effect, while the other caused the system to hang, albeit not immediately.

I also found pci_common_swizzle with some comments about it being required by the PCI standard, so I call that after the above function. After I do those things, I then call request_irq. With these changes, the system hangs immediately on insmod.

Of course, I realize that iterating through the bridges and forcing PCI_COMMAND_INTX_DISABLE off is a disgusting hack, and I wouldn't be surprised if it's that or the swizzle that causes the system hang.

Anyhow, so I'm lost and baffled here. Does anyone know what I'm doing wrong? How am I supposed to get that system bridge to allow legacy interrupts to pass through?

Thanks in advance for the help!

Timothy Miller
  • 1,527
  • 4
  • 28
  • 48
  • I posted a similar question to LKML: https://lkml.org/lkml/2018/3/26/590 – Timothy Miller Mar 27 '18 at 14:38
  • Could you check the bits 3 of status register in configuration space which indicates the interrupt status when you believe interrupt should be delivered? – Chris Tsui Mar 28 '18 at 15:08
  • Try this just to debug and to eliminate strange BIOS problems with resource allocation. Do "lspci -tv" and find the PCIe switch/bridge sitting where your device is. Then goto the shell and do echo 1 > /sys/bus/pci/devices//remove ( or similar ). This will make your device and bridge to disappear from "lspci". Now do echo 1 > /sys/bus/pci/rescan (IIRC). This will make Linux PCI realloacate all resources again. Make sure your device and bridge reappear. It may get you around broken BIOSes. I don't know if it will fix your issue but this is how I will go about debugging.. – Chaitanya Lala Mar 29 '18 at 18:27
  • @ChaitanyaLala I actually got it all working. The solution was to just traverse up the bridge hierarchy and make sure that DisINTx is clear all the way up. After that, when I call pci_enable_device, it actually changes the IRQ line for my device to a level-triggered one, but only in software. So then I write the new IRQ line into the PCI config space, and then request_irq will work, and I get interrupts. The big question is, why is this not happening automatically? I don't think it's very "platform independent" to go hacking PCI config values, so what am I doing wrong? – Timothy Miller Mar 29 '18 at 20:30
  • @TimothyMiller Sounds like something to do with bridge initial configuration. Try the step I talked about in my previous comment. It will clearly tell us if this is a BIOS issue or not. – Chaitanya Lala Mar 30 '18 at 01:20

0 Answers0