New Driver for Realtek RTL8111

Mieze · June 1, 2013

The main reason I'm choosing for SMB instead of all other protocols (Netatalk/NFS, etc.) is that it works on all devices that are in my (home) network.

Also it is the protocol I already have knowledge of how it works. (That doesn't mean I don't want to learn about others..)

I'm still thinking of adding other sharing-protocols to my NAS, but I don't want the network to be 'loaded' with unneeded traffic/broadcasts.

I also don't know what impact it has on my NAS, because.. well it is not the most powerful machine.

There is no need to worry. Most SoC based NAS boxes are even less powerful and they run SAMBA and Netatalk simultaneously. Two years ago, when I started experimenting with Netatalk I was using a WD MyBook Live (800MHz PPC single Core, 256MB RAM). As the box runs debian I began to customize it. Finally I managed to get Netatalk, SAMBA and openldap working without any performance issues (~90MB/sec throughput read/write) and with more RAM I could have added even more features.

Mieze

Mrengles · June 1, 2013

Hello everyone,

I know there has been talk about poor performance with SMB but has anyone noticed poor refresh rates using this driver with ARD Apple Remote Desktop as a server?

I haven't noticed any errors in the log messages, or have I tried the debug version. I still need to do more testing, I will post my results after I have completed more testing.

I just wanted to get someone else's take or experience with Apple Remote Desktop and this driver.

Thanks,

Robert (mrengles)

genzai · June 1, 2013

I just wanted to get someone else's take or experience with Apple Remote Desktop and this driver.

FWIW, I use ARD almost daily from my Z77X-UP5 (intel nic) to my Z68MX (realtek with this driver) and it works very well. no refresh issues and performance not any noticeably different than when i was using the lnx2mac as far as ARD. In this case the realtek is sharing its desktop to my intel LAN if that matters. I rarely do the opposite direction. The only time i had refresh issues when on my LAN was when i discovered ARD decided to connect to my mac mini server over IPv6 which i think was actually going over the internet and back.

Hope that helps,

g\

Mieze · June 1, 2013

Hello everyone,

I know there has been talk about poor performance with SMB but has anyone noticed poor refresh rates using this driver with ARD Apple Remote Desktop as a server?

I haven't noticed any errors in the log messages, or have I tried the debug version. I still need to do more testing, I will post my results after I have completed more testing.

Hello mrengels!

I use the same configuration for my homeserver (10.8.3 Server) which usually runs without display connected. On the login screen the screen refresh is sometimes slow and incomplete. When I'm logged in the refresh is fast but I get artifacts (black frames or rectangles) from time to time, but when a display is connected to the machine (HD4000/DVI), ARD works flawlessly. There are no artifacts or slow refreshes.

All machines are connected via Gigabit Ethernet.

The problem is not related to any particular NIC or driver. I saw this with the lnx2mac driver I used when I set up the machine last year. Later I added an Intel 82574L card using Apple's driver and disabled onboard LAN. The problem persisted. Two month ago I switched over to the Realtek NIC using my driver but nothing has changed.

Mieze

Mieze · June 1, 2013

I just wanted to let you know about the results of my latest tests with regard to the SMB performance issue.

SMB throughput when communicating with another Mac via SMB has been significantly improved so that it is on a par with Apple's Broadcom driver in both directions.
When communicating with Win7 machines performance is also good.
With Windows Server 2008 R2 (64bit) performance is even better than with Win7 in both directions.
Communication with WinXP hosts hasn't improved at all and is still lousy.

The strange thing is that Apple's Broadcom driver shows the same weakness when exchanging data with WinXP machines. The performance is as bad as with my driver. It looks like certain Windows versions trigger the issue?

Mieze

Edited June 3, 2013 by Mieze

RehabMan · June 3, 2013

I just wanted to let you know about the results of my latest tests with regard to the SMB performance issue.

SMB throughput when communicating with another Mac via SMB has been significantly improved so that it is on a par with Apple's Broadcom driver in both directions.

When communicating with Win7 machines performance is also good.

With Windows Server 2008 R2 (64bit) performance is even better than with Win7 in both directions.

Communication with WinXP hosts hasn't improved at all and is still lousy.

The strange thing is that Apple's Broadcom driver shows the same weakness when exchanging data with WinXP machines. The performance is as bad as with my driver. It looks like certain Windows versions trigger the issue?

Mieze

No changes to achieve this?? FYI: My WHS2011 is basically a Windows 2008 R2...

Mieze · June 3, 2013

No changes to achieve this?? FYI: My WHS2011 is basically a Windows 2008 R2...

I've changed interrupt mitigate to 0xaf54 because listing large directory trees using ls -alR with AFP was a little bit slower with 0xaf83. I haven't checked the previous value 0xaf83 with Server 2008 R2 because it's clear that 0xaf54 is closer to the optimum.

By the way I added a config option to Info.plist to set the interrupt mitigate value without rebuild. It's quite straightforward, you'll only have to convert the hex number to a decimal and put it in.

Lets summarize the changes since the last official release, version 1.0.4:

Support for TCP/IPv6 and UDP/IPv6 checksum offload added (can be disabled in Info.plist).
Maximum size of the scatter-gather-list has been increased from 24 to 40 segments to resolve performance issues with TSO4 when offloading large packets which are highly fragmented.
TSO4 can be disabled in Info.plist without rebuild.
Statistics gathering has been improved to deliver more detailed information (resource shortages, transmitter resets, transmitter interrupt count).
The interrupt mitigate settings has been changed to improve performance with SMB and to reduce CPU load.
Configuration option added to allow for user defined interrupt mitigate settings without rebuild (see above).

You are encouraged to test this release candidate thoroughly, in particular with IPv6. As I don't have an IPv6 enabled internet connection my tests are limited to LAN but so far I have no evidence for any problems with TCP/IPv6 and UDP/IPv6 checksum offload. This version is running perfectly on my home server for 2 days now and if there are no unexpected problems I'm planning to make this version the next stable release, version 1.1.0.

Known issues:

There are still performance problems with regard to SMB in certain configurations. My tests indicate that Apple's Broadcom driver shows the same behavior with those configurations. Obviously it's a more general problem that is not limited to my driver.
RTL8111C: WoL does not work .

Mieze

@nozyczek: This version uses 40 for kMaxSegs. Please test it in order to see if this is sufficient.

RealtekRTL8111-V1.1.0-RC1.zip

Edited June 3, 2013 by Mieze

Cyberdog ! · June 4, 2013

Hello Mieze

I'm french, my english is bad.

My Mobo is Gigabyte GA-H55M-S2 with LGA1156 and Core I3, ethernet Realtek 8111E

After try a lot of version from http://lnx2mac.blogs...osx-driver.html with 10.8.3, no WOL and no WOD.

So

- clean up my DSDT = nothing

- clear cache = nothing.

- try different version = nothing

I try your driver and ALL WORKS with my 10.8.3.

Very Very Very Very Very good work

Thank you, Thank you, Thank you, Thank you

:thumbsup_anim: :thanks_speechbubble:

leslieking · June 4, 2013

Are you sure that the permissions are set to root:wheel 755? I managed to load the driver even from desktop provided the permissions are correct.

Mieze

Yes, I did that. Same issue. System profiler shows no Ethernet adapter while booting from USB flash drive.

I created the USB drive using this guide, then moved your kext into /Extra/Extension. All other kexts (FakeSMC, NullCPUPowerManagement, VoodooPS2Controller) work.

RehabMan · June 4, 2013

Yes, I did that. Same issue. System profiler shows no Ethernet adapter while booting from USB flash drive.

I created the USB drive using this guide, then moved your kext into /Extra/Extension. All other kexts (FakeSMC, NullCPUPowerManagement, VoodooPS2Controller) work.

You probably have another driver in /Extra/Extensions that is conflicting. A lot of USB tools install a rollback IONetworkFamily.kext that includes a lot of network drivers in the Contents/PlugIns directory, for example.

RehabMan · June 4, 2013

I've changed interrupt mitigate to 0xaf54 because listing large directory trees using ls -alR with AFP was a little bit slower with 0xaf83. I haven't checked the previous value 0xaf83 with Server 2008 R2 because it's clear that 0xaf54 is closer to the optimum.

By the way I added a config option to Info.plist to set the interrupt mitigate value without rebuild. It's quite straightforward, you'll only have to convert the hex number to a decimal and put it in.

Lets summarize the changes since the last official release, version 1.0.4:

Support for TCP/IPv6 and UDP/IPv6 checksum offload added (can be disabled in Info.plist).

Maximum size of the scatter-gather-list has been increased from 24 to 40 segments to resolve performance issues with TSO4 when offloading large packets which are highly fragmented.

TSO4 can be disabled in Info.plist without rebuild.

Statistics gathering has been improved to deliver more detailed information (resource shortages, transmitter resets, transmitter interrupt count).

The interrupt mitigate settings has been changed to improve performance with SMB and to reduce CPU load.

Configuration option added to allow for user defined interrupt mitigate settings without rebuild (see above).

You are encouraged to test this release candidate thoroughly, in particular with IPv6. As I don't have an IPv6 enabled internet connection my tests are limited to LAN but so far I have no evidence for any problems with TCP/IPv6 and UDP/IPv6 checksum offload. This version is running perfectly on my home server for 2 days now and if there are no unexpected problems I'm planning to make this version the next stable release, version 1.1.0.

Known issues:

There are still performance problems with regard to SMB in certain configurations. My tests indicate that Apple's Broadcom driver shows the same behavior with those configurations. Obviously it's a more general problem that is not limited to my driver.

RTL8111C: WoL does not work .

Mieze

@nozyczek: This version uses 40 for kMaxSegs. Please test it in order to see if this is sufficient.

OK... results with this version: My reads from server have now improved to 4-5MB/sec with net.inet.tcp.delayed_ack=3 (the default). If I change that to net.inet.tcp.delayed_ack=0 I can get average 20MB/sec with peaks into 30MB/sec (which is better than I've ever seen with this driver). Writes, however, are now slowed down to 2-3MB/sec (either setting for delayed_ack), but as I stated before the 'better' write performance seemed random, so maybe I just haven't been lucky lately.

Hopefully, some day I'll have more time to chase this down more fully or another machine to test with, but for now that's all I have...

Mieze · June 4, 2013

OK... results with this version: My reads from server have now improved to 4-5MB/sec with net.inet.tcp.delayed_ack=3 (the default). If I change that to net.inet.tcp.delayed_ack=0 I can get average 20MB/sec with peaks into 30MB/sec (which is better than I've ever seen with this driver). Writes, however, are now slowed down to 2-3MB/sec (either setting for delayed_ack), but as I stated before the 'better' write performance seemed random, so maybe I just haven't been lucky lately.

Hopefully, some day I'll have more time to chase this down more fully or another machine to test with, but for now that's all I have...

Last night I had the idea to play with the LAN connection setting on the XP machine (Macbook Pro late 2006, Marvell Yukon NIC, as client) in order to improve SMB performance. Although I didn't had the time for extensive experiments, only 20 minutes, the results are promising. Disabling QoS Packet Scheduler for the connection boosted read throughput so that I got decent reads for the first time. I was able to copy a 2GB file from the server to the XP client in less than a minute. Unfortunately write speed seems to be unaffected by this change.

I also tried to vary the NIC interrupt mitigate settings but didn't got any conclusive results except the fact that there is an influence on SMB performance.

I know that you can't apply these results directly to WHS 2011 but I think that it might be worth to give it a try.

Mieze

nozyczek · June 7, 2013

@nozyczek: This version uses 40 for kMaxSegs. Please test it in order to see if this is sufficient.

Mieze,

RealtekRTL8111-V1.1.0-RC1looks great! iperf shows stable ~941 both ways. Impressive!

Awesome job!

RehabMan · June 7, 2013

Last night I had the idea to play with the LAN connection setting on the XP machine (Macbook Pro late 2006, Marvell Yukon NIC, as client) in order to improve SMB performance. Although I didn't had the time for extensive experiments, only 20 minutes, the results are promising. Disabling QoS Packet Scheduler for the connection boosted read throughput so that I got decent reads for the first time. I was able to copy a 2GB file from the server to the XP client in less than a minute. Unfortunately write speed seems to be unaffected by this change.

I also tried to vary the NIC interrupt mitigate settings but didn't got any conclusive results except the fact that there is an influence on SMB performance.

I know that you can't apply these results directly to WHS 2011 but I think that it might be worth to give it a try.

Mieze

I'm making some progress. I have fixed the slow receive/read problem. By looking at slice's code and the original Linux driver code and a *lot* of experimentation, study, code review, etc, I have boiled the receive problem down to differences in your version of interruptOccurred.

Here is the new version:


void RTL8111::interruptOccurred(OSObject *client, IOInterruptEventSource *src, int count)
{
WriteReg16(IntrMask, 0x0000);

for (int count = 20; count > 0; count--) {

/* Read interrupt status to determine work */
UInt16 status = ReadReg16(IntrStatus);
status &= (intrMask | TxDescUnavail);
/* Clear interrupt status with work done this iteration */
WriteReg16(IntrStatus, status);

/* hotplug/major error/no more work/shared irq */
if ((status == 0xFFFF) || !(status & intrMask))
break;

if (status & SYSErr) {
pciErrorInterrupt();
break;
}

/* Seems redundant, but it's in the 8168 code... */
if ((status & TxOK) && (status & TxDescUnavail)) {
WriteReg8(TxPoll, NPQ); /* set polling bit */
}

/* Rx interrupt */
if (status & (RxOK | RxDescUnavail | RxFIFOOver))
rxInterrupt();

/* Tx interrupt */
if (status & (TxOK | TxErr /*| TxDescUnavail*/))
txInterrupt();

if (status & LinkChg)
checkLinkStatus();

/* Check if a statistics dump has been completed. */
if (needsUpdate && !(ReadReg32(CounterAddrLow) & CounterDump))
updateStatitics();
}

if (0 == count) {
IOLog("Ethernet [RealtekRTL81111]: max count reached in interrupt service.\n");
}

/* Write clean mask */
WriteReg16(IntrMask, intrMask);
}

Now I get about ~51MB/sec with a Finder copy from my SMB server to the SSD on the laptop. I think the critical change is that the IntrStatus is written closer (in time) to the read. Probably the loop is helpful too...

File copy to the server is about 7 to 10MB/sec which is better too. And now I can reproduce the effect of doing copies simultaneously increasing the write speed. If during the copy of a large file to the server (the write case here), I simultaneously start copying a large file from the server to the laptop (the read case above) throughput on the copy to the server jumps to ~49MB/sec (read speed remains stable at ~51MB/sec). If I stop the copy from the server to the laptop, the copy to the server slows back down.

This could be affected by the interrupt mitigate value, so I'll play with that too. With a lot of received packets happening, it is more likely more interrupts are generated, and since all types of interrupts are processed each interrupt...

I'm going to experiment with the code some more on the transmit side. Since slice's version doesn't have this issue, I think I'll experiment copying the output packets to memory as one descriptor. This is the only major difference I can see between your driver and slice's version... You send output packets as potentially fragmented/chained dma descriptors, whereas slice's driver always sends a packet as a single dma descriptor.

BTW, I have fixed all the bugs in slice's version as far as not negotiating speed 1000, and other weird happenings when the cable is unplugged (there was a garbage local being used). So at this point, I'm looking to fix your driver instead of trying to add checksum offload to slice's driver.

I'll update status here as I figure out more...

P.S. Sorry about the poor indenting in the code above. It should look ok when you paste it with xcode. This site is really brain dead when it comes to stripping leading spaces from code blocks.

Mieze · June 7, 2013

Now I get about ~51MB/sec with a Finder copy from my SMB server to the SSD on the laptop. I think the critical change is that the IntrStatus is written closer (in time) to the read. Probably the loop is helpful too...

If clearing the interrupt status register earlier in the loop helps, it would be no problem for me to change this, provided it doesn't cause any unwanted side effects. If the loop has any effect could be easily determined by adding a statistics variable to get the maximum number count has reached but as the txInterrupt() and rxInterrupt() functions handle as many finished descriptors as available, i.e. all received / transmitted packets, I doubt that count won't go higher than 1 or 2, but of course this depends on your exact system configuration. In case the driver's thread gets preempted in between there might be more runs.

File copy to the server is about 7 to 10MB/sec which is better too. And now I can reproduce the effect of doing copies simultaneously increasing the write speed. If during the copy of a large file to the server (the write case here), I simultaneously start copying a large file from the server to the laptop (the read case above) throughput on the copy to the server jumps to ~49MB/sec (read speed remains stable at ~51MB/sec). If I stop the copy from the server to the laptop, the copy to the server slows back down.

Have you played with the settings on the Windows machine? I'm pretty sure that the slow writes are triggered by it because my test results with different setups show that this is the only consistent factor.

This could be affected by the interrupt mitigate value, so I'll play with that too. With a lot of received packets happening, it is more likely more interrupts are generated, and since all types of interrupts are processed each interrupt...

Keep an eye on system load in general and on smbd in particular. top is a good helper to find out what the machine is doing. A high load might be the result of too many interrupts but in case it's idling most of the time during a write operation it might be waiting for an answer from the other endpoint that isn't coming. The latest release counts transmitter interrupts and puts the value into the ethernet statistics so that you can check it in IORegistryExplorer and after uncommenting the last line in rxInterrupt() you'll also get the number of receiver interrupts.

I'm going to experiment with the code some more on the transmit side. Since slice's version doesn't have this issue, I think I'll experiment copying the output packets to memory as one descriptor. This is the only major difference I can see between your driver and slice's version... You send output packets as potentially fragmented/chained dma descriptors, whereas slice's driver always sends a packet as a single dma descriptor.

No, the most important difference is concurrency which widely affects timing. I let the NIC calculate checksums and segment large TCP packets so that the network stack is less involved in defining the exact timing because there is a lot of work still going on after outputPacket() returned. TSO acts on packets of up to 64KB which means that one call of outputPacket() could result in the transmission of more than 40 ethernet packets.

On the receiver side packets come in with checksum verification already done which means that they will be handled much faster. Also keep in mind the side effects of checksum calculation by the CPU. It's not limited to consumption of cycles but it also churns up the cache affecting other tasks too. Microsoft has done excellent research on that topic (see NDIS docs).

Mieze

Mieze,

RealtekRTL8111-V1.1.0-RC1looks great! iperf shows stable ~941 both ways. Impressive!

Awesome job!

Thank you very much for the tests! I will push the latest version to github next week and update the binaries.

Mieze

RehabMan · June 8, 2013

If clearing the interrupt status register earlier in the loop helps, it would be no problem for me to change this, provided it doesn't cause any unwanted side effects. If the loop has any effect could be easily determined by adding a statistics variable to get the maximum number count has reached but as the txInterrupt() and rxInterrupt() functions handle as many finished descriptors as available, i.e. all received / transmitted packets, I doubt that count won't go higher than 1 or 2, but of course this depends on your exact system configuration. In case the driver's thread gets preempted in between there might be more runs.

I'll do a test w/ only one loop just to see if the loop is helping. Both slice's driver and the Linux driver have this loop and that's why I added it.

Have you played with the settings on the Windows machine? I'm pretty sure that the slow writes are triggered by it because my test results with different setups show that this is the only consistent factor.

I did but they didn't make a difference. On top of that, this performance problem is only present with your driver, which makes me think it is something in the driver (like it is [was] with the receive side...)

Keep an eye on system load in general and on smbd in particular. top is a good helper to find out what the machine is doing. A high load might be the result of too many interrupts but in case it's idling most of the time during a write operation it might be waiting for an answer from the other endpoint that isn't coming. The latest release counts transmitter interrupts and puts the value into the ethernet statistics so that you can check it in IORegistryExplorer and after uncommenting the last line in rxInterrupt() you'll also get the number of receiver interrupts.

Thanks... I'll check it out. Last time I looked for 'smbd' I could not find it.

No, the most important difference is concurrency which widely affects timing. I let the NIC calculate checksums and segment large TCP packets so that the network stack is less involved in defining the exact timing because there is a lot of work still going on after outputPacket() returned. TSO acts on packets of up to 64KB which means that one call of outputPacket() could result in the transmission of more than 40 ethernet packets.

At this point there are a lot of differences, and I don't know yet what the difference causing my issue is, but I hope to figure it out. I think I'll also add some code to track how many segments might be in the mbuf passed to outputPacket and how large they are. Slice's code assumes there is 1608 or less bytes in each mbuf_t passed to outputPacket (that's how much memory is allocated for each tx dma descriptor buffer, and the code doesn't check for overflow). Is your driver's outputPacket treated differently for some reason where you must deal with larger mbuf_t packets?

On the receiver side packets come in with checksum verification already done which means that they will be handled much faster. Also keep in mind the side effects of checksum calculation by the CPU. It's not limited to consumption of cycles but it also churns up the cache affecting other tasks too. Microsoft has done excellent research on that topic (see NDIS docs).

I agree that checksum offload is a great feature to have and I can see where providing dma pointers directly into the network stack buffers (mbuf_t) should be of great advantage. But not when I get only 10MB/sec on a gig connection.

I appreciate the feedback... I'm learning as I go...

Mieze · June 8, 2013

I'll do a test w/ only one loop just to see if the loop is helping. Both slice's driver and the Linux driver have this loop and that's why I added it.

With linux it makes sense because the interrupt handler runs at interrupt level but as I already stated earlier in this thread OS X is different with regard to that point.

I did but they didn't make a difference. On top of that, this performance problem is only present with your driver, which makes me think it is something in the driver (like it is [was] with the receive side...)

No, the problem also exists with Apple's Broadcom driver. Check out the reports about bad SMB performance of Apple users.

At this point there are a lot of differences, and I don't know yet what the difference causing my issue is, but I hope to figure it out. I think I'll also add some code to track how many segments might be in the mbuf passed to outputPacket and how large they are. Slice's code assumes there is 1608 or less bytes in each mbuf_t passed to outputPacket (that's how much memory is allocated for each tx dma descriptor buffer, and the code doesn't check for overflow). Is your driver's outputPacket treated differently for some reason where you must deal with larger mbuf_t packets?

First, Slice's driver doesn't have to handle physical segments because it copies every packet to/from a physical contiguous DMA buffer. Second, as long as you don't use TSO or jumbo frames there won't be any packets (mbufs) larger than 1518 Bytes. Third, you won't receive any multisegment mbufs unless the driver tells the network stack in getFeature() that it can handle them.

As of now I haven't seen much debug data from you so that I barely know what is going on.

Edit: Are you aware of the fact that this could trigger a feedback loop? Letting the NIC poll the transmitter descriptor ring will cause the TxDescUnavail bit in the interrupt status register to be set again when all descriptors have been finished and I don't see TxDescUnavail to be cleared at any point.

/* Seems redundant, but it's in the 8168 code... */
if ((status & TxOK) && (status & TxDescUnavail)) {
WriteReg8(TxPoll, NPQ); /* set polling bit */
}

Mieze

Edited June 8, 2013 by Mieze

RehabMan · June 8, 2013

With linux it makes sense because the interrupt handler runs at interrupt level but as I already stated earlier in this thread OS X is different with regard to that point.

Yes, and I thought that too about setting the IntrMask to zero and re-enabling it at the end -- shouldn't be necessary with a workloop based interrupt, right?. Since this OS X interrupt handler is not a "real" interrupt handler (it executes in a kernel thread, part of workloop, after the actual interrupt has been handled; real interrupt handler just triggers the thread/workloop). But I tried removing it and it caused all kinds of problems... and I still don't understand why.

No, the problem also exists with Apple's Broadcom driver. Check out the reports about bad SMB performance of Apple users.

Doesn't happen w/ lnx2mac or slice's, so it is something I want to keep looking at before I blame Apple completely.

First, Slice's driver doesn't have to handle physical segments because it copies every packet to/from a physical contiguous DMA buffer. Second, as long as you don't use TSO or jumbo frames there won't be any packets (mbufs) larger than 1518 Bytes. Third, you won't receive any multisegment mbufs unless the driver tells the network stack in getFeature() that it can handle them.

Sounds like something more to play with. I notice slice's driver doesn't implement getFeatures so it must be getting base class implementation.

As of now I haven't seen much debug data from you so that I barely know what is going on.

There really isn't anything to see. I've run the debug version w/ DebugLog modified so the output from the driver can be easily identified, and there is almost nothing of interest.

Edit: Are you aware of the fact that this could trigger a feedback loop? Letting the NIC poll the transmitter descriptor ring will cause the TxDescUnavail bit in the interrupt status register to be set again when all descriptors have been finished and I don't see TxDescUnavail to be cleared at any point.
/* Seems redundant, but it's in the 8168 code... */
if ((status & TxOK) && (status & TxDescUnavail)) {
WriteReg8(TxPoll, NPQ); /* set polling bit */
}

Thanks for the heads up, but I don't think it will because TxDescUnavail is not set in IntrMask. And, actually, I think it is cleared by this code:

/* Read interrupt status to determine work */
UInt16 status = ReadReg16(IntrStatus);
status &= (intrMask | TxDescUnavail);
/* Clear interrupt status with work done this iteration */
WriteReg16(IntrStatus, status);

But this code is not really necessary for the 'receive fix'. But I'm working on fixing the transmit side too... Accidently left it in there before I posted the code for you. It was something in slice's code, so I thought it was worth a try (being that is xmit related).

I think just moving the write to IntrStatus closer to the read helps me on the receive side. Maybe you could test it on your side to see if it causes any issues with your devices. I'll keep working on the xmit problem.

Mieze · June 8, 2013

Yes, and I thought that too about setting the IntrMask to zero and re-enabling it at the end -- shouldn't be necessary with a workloop based interrupt, right?. Since this OS X interrupt handler is not a "real" interrupt handler (it executes in a kernel thread, part of workloop, after the actual interrupt has been handled; real interrupt handler just triggers the thread/workloop). But I tried removing it and it caused all kinds of problems... and I still don't understand why.

Interrupt mask has to be cleared in order to clear bits in interrupt status properly. I already figured that out during my tests a long time ago.

There really isn't anything to see. I've run the debug version w/ DebugLog modified so the output from the driver can be easily identified, and there is almost nothing of interest.

Maybe there isn't anything, but there could be the missing hint. In between it would be interesting to find out what the machine is doing during send operations. What does top say? By the way the watchdog timer routine is very useful to get statistics data every second. How many interrupts are there? Have you created a packet dump with Wireshark?

Thanks for the heads up, but I don't think it will because TxDescUnavail is not set in IntrMask. And, actually, I think it is cleared by this code:
/* Read interrupt status to determine work */
UInt16 status = ReadReg16(IntrStatus);
status &= (intrMask | TxDescUnavail);
/* Clear interrupt status with work done this iteration */
WriteReg16(IntrStatus, status);
But this code is not really necessary for the 'receive fix'. But I'm working on fixing the transmit side too... Accidently left it in there before I posted the code for you. It was something in slice's code, so I thought it was worth a try (being that is xmit related).

I think just moving the write to IntrStatus closer to the read helps me on the receive side. Maybe you could test it on your side to see if it causes any issues with your devices. I'll keep working on the xmit problem.

Bits in the interrupt status register might get set even if the corresponding bit in the interrupt mask register is cleared. Although they don't cause an interrupt when they are masked, they prevent your loop from exiting when work is done which means that your loop boils down to busy waiting which could be achieved easier.

Mieze

RehabMan · June 8, 2013

Interrupt mask has to be cleared in order to clear bits in interrupt status properly. I already figured that out during my tests a long time ago.

It is a little strange, as you would think the mask and status would be independent. But I'm sure Realtek expected these tasks to be performed at actual interrupt time instead of later... Probably this is documented by Realtek, but probably only under NDA...

Bits in the interrupt status register might get set even if the corresponding bit in the interrupt mask register is cleared. Although they don't cause an interrupt when they are masked, they prevent your loop from exiting when work is done which means that your loop boils down to busy waiting which could be achieved easier.

Yes, they may/will get set, but they won't generate interrupts and the code will not stay in that loop for bits that aren't in the mask. See:

/* hotplug/major error/no more work/shared irq */
if ((status == 0xFFFF) || !(status & intrMask))
break;

But all this is neither here nor there. This is just experimental code to try and determine the root cause of the problem.

Here's an interesting experiment I did:

void RTL8111::interruptOccurred(OSObject *client, IOInterruptEventSource *src, int count)
{
WriteReg16(IntrMask, 0x0000);

for (int count = 1/*kMaxInterruptWork*/; count > 0; count--) {

/* Read interrupt status to determine work */
UInt16 status = ReadReg16(IntrStatus);
status &= (intrMask | TxDescUnavail);

/* hotplug/major error/no more work/shared irq */
if ((status == 0xFFFF) || !(status & intrMask))
break;

if (status & SYSErr) {
pciErrorInterrupt();
break;
}

/* Seems redundant, but it's in the 8168 code... */
////if ((status & TxOK) && (status & TxDescUnavail)) {
//// WriteReg8(TxPoll, NPQ); /* set polling bit */
////}

/* Tx interrupt */
if (status & (TxOK | TxErr | TxDescUnavail)) {
txInterrupt();
/* !!!!! EXPERIMENTAL !!!!! */
if (kNumTxDesc != txNumFreeDesc)
status &= ~(TxOK | TxErr | TxDescUnavail);
}

/* Clear interrupt status with work done this iteration */
WriteReg16(IntrStatus, status);

/* Rx interrupt */
if (status & (RxOK | RxDescUnavail | RxFIFOOver))
rxInterrupt();

if (status & LinkChg)
checkLinkStatus();

/* Check if a statistics dump has been completed. */
if (needsUpdate && !(ReadReg32(CounterAddrLow) & CounterDump))
updateStatitics();
}

if (0 == count) {
IOLog("Ethernet [RealtekRTL81111]: max count reached in interrupt service.\n");
}

/* Write clean mask */
WriteReg16(IntrMask, intrMask);
}

Basically, just as an experiment, I left the xmit related status bits uncleared if there was still work pending in the xmit descriptors... With this, I get ~40MB/sec writes from laptop to sever. Excessive CPU usage along with it, of course, but it also 'fixes' the write performance problem. Perhaps that gives us some clues. It certainly gives me some further ideas to try.

It kind of says to me "too much interrupt mitigation on the receive side." SMB stack may be waiting on acks that are late to arrive?

Also, I'm kind of wondering if the chip doesn't appreciate having the interrupt status query/clear so late (well after asserting IRQ). I may experiment with installing a real interrupt handler to clear status earlier in the process (saving the status [cumulative bitwise-or] for later inspection by the workloop based interrupt handler, of course).

I won't have time to work on this for the next couple of days, but will resume when I can...

Mieze · June 8, 2013

It is a little strange, as you would think the mask and status would be independent. But I'm sure Realtek expected these tasks to be performed at actual interrupt time instead of later... Probably this is documented by Realtek, but probably only under NDA...

There are 25 versions of the RTL8111, each with its own bugs and quirks. As Realtek's products are targeted to the mass market it might also be a matter of cost not to eliminate design errors.

Basically, just as an experiment, I left the xmit related status bits uncleared if there was still work pending in the xmit descriptors... With this, I get ~40MB/sec writes from laptop to sever. Excessive CPU usage along with it, of course, but it also 'fixes' the write performance problem. Perhaps that gives us some clues. It certainly gives me some further ideas to try.

We know now conclusively that busy waiting resolves the issue.

It kind of says to me "too much interrupt mitigation on the receive side." SMB stack may be waiting on acks that are late to arrive?

... so that network statistics and Wireshark should show a huge number of retransmitted packets during SMB writes while the CPU would be idling most of the time. Running ping for a minute should give you a rough estimation about packet roundtrip time.

Also, I'm kind of wondering if the chip doesn't appreciate having the interrupt status query/clear so late (well after asserting IRQ). I may experiment with installing a real interrupt handler to clear status earlier in the process (saving the status [cumulative bitwise-or] for later inspection by the workloop based interrupt handler, of course).

This would result in an impact on any protocol but so far we only have a SMB performance issue.

Mieze

Mieze · June 8, 2013

I've pushed version 1.1.0 to github and updated the prebuild binaries. There have been no changes since RC1. I decided to put the binaries into the download section of this site. Please see:http://www.insanelymac.com/forum/files/category/5-lan-and-wireless/

Mieze

nozyczek · June 11, 2013

Mieze,

I just ran 1.1.0 under 10.9 dp1. Everything seems to be working OK. I will do performance testing, WOL etc when I find a moment.

nozyczek

nozyczek · June 11, 2013

LOL, just got this popup

Mieze · June 11, 2013

Mieze,

I just ran 1.1.0 under 10.9 dp1. Everything seems to be working OK. I will do performance testing, WOL etc when I find a moment.

nozyczek

Thanks for the test! By the way Realtek has just updated the Linux sources the driver is based on. I'll merge in that new code, version 8.036.00, as soon as possible. My plans for the future also include:

Try to find a solution for the WoL issue with the RTL8111C.
Add support for TCP/IPv6 segmentation offload (TSO6). After reverse engineering the Win7 driver I found out how it has to be done but still haven't found some time to test my theory.

Mieze

New Driver for Realtek RTL8111

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites