How we tamed Node.js event loop lag: a deepdive

40 points by mifydev a year ago

This is not about “taming lag” as suggested by the title, which implies some form of failure on node’s part.

They accidentally wrote synchronous O(n^2) code that hogged the CPU, blocking the event loop, then fixed it. But that doesn’t sound as adventurous…

Otherwise a solid example of using observability tools to debug a live issue.

williamdclt a year ago

While I don’t think the article is very advanced, it’s really not about the root cause. The O(n^2) code isn’t the subject (they don’t even show the fix, as it’s not really interesting).
It’s about how to systematically detect and debug the problem. In Node that’s not a trivial thing to do. That has value
- moralestapia a year ago
  
  >In Node that’s not a trivial thing to do.
  Depends on your code best practices, I've found it way easier than other platforms I've used (C++, Python). Even without explicit interrupts and such.
  - williamdclt a year ago
    
    I might be missing something, what code practices do you have in mind that help with CPU profiling?
    
    moralestapia a year ago
    
    Things that help:
    * No anonymous functions/lambdas (unless they're extremely trivial, but then you probably don't need that in a function at all)
    * Avoid recursion
    * Functions do one simple thing and return
    * Only one return at the end of the function body
    * No uncaught exceptions
    * Functions that "await" and functions that "compute" are separate ones
    * Avoid 3rd party libraries unless it is absolutely necessary
    * Write code for a single thread model, scale horizontally with cluster, things like pm2 or just running several node processes
    Never had to write them down, tbh, there's definitely many more. This could apply to every language, though, not just JS. You could borrow some practices from realtime computing, as well.
    If you do most of these things, debugging with console.trace is straightforward. You can also use one of the flamegraph tools out there for the profiling part.
    They only thing you can't control (but you can examine to some extent) is the GC, but in practice that's never been an issue for me since the GC that comes with V8 is really good.
- bluelightning2k a year ago
  
  Very well said!
  I'm a pretty sophisticated node dev and I haven't heard of the technique of logging spans whenever the event loop blocks for 100 ms (similar to how you get warnings by default in frameworks/chrome for going over 100 ms on event handlers).
  Obviously simple to run a setInterval and compare wall clock, but I would have no idea how to detect the actual issue.
gcau a year ago

You misunderstand. CPU hogging causes event loop delay (lag), they are "taming" that by fixing that cpu-hogging code. There is no implication nodejs itself is the cause of the lag.

dexwiz a year ago

Says event loop in the title, but the real culprit is a non paginated endpoint with a nested looped. Pagination or guard rails are basic things for customer facing features. Any time you design a service for X items, some will try it with 10x-1000X items. Be ready for that.

corytheboyd a year ago

And every single time a product person will challenge the need for pagination, or any other limits that make your systems actually scalable. Sigh.

williamdclt a year ago

I’m a bit confused by the monitoring described. Event loop lag is insidious because it doesn’t affect only the slow part of your app, it affects everything: one small part of a request takes seconds, making every concurrent request take seconds. Generally, i found that when the event loop is having lag issue, you can’t really trust much of your application monitoring (OTel spans are very long, but it’s actually just waiting for the event loop). How then did find the root causes of these lag issues?

As an aside, it’s a bit weird to create a span to mark that something happened, OTel events are made for that

Aeolun a year ago

Having a span for your application waiting for some amount of time makes sense though? Are you talking about something else?

jauntywundrkind a year ago

There's an easy but terrible fix for event loops lag. By app means, limit your work, do hard stuff like these folks did. But. If you just want to stop the suffering, all you need to do is yield periodically!

  if (n % 1000 === 0) await require('node:timers/promises').setImmediste()

If you sleep in your async functions, other work can flow. Just unblock the other work.

Node calls this partitioning your work. https://nodejs.org/en/learn/asynchronous-work/dont-block-the...

Maybe don't use this specific library but it's pretty easy to rebuild everyday array iterators (forEach, reduce, map) to automatically yield every n iterations. https://www.npmjs.com/package/nice-loops

moonlion_eth a year ago

How we turned our amateur code into click bait

cpursley a year ago

tl;dr: should have used Erlang/Elixir as well as our customers.

mplewis a year ago

How does Erlang fix this?
- cpursley a year ago
  
  You can do async right in the runtime that takes advantage of all threads.

sgarland a year ago

[flagged]

hombre_fatal a year ago

I use emojis all the time in logs when I want messages to stand out against a wall of text.
Let's try to evolve beyond disparaging those who have different preferences than you.
- sgarland a year ago
  
  I maintain the systems that handle the logs, and I look at them far more than the people inserting emoji into them. I want to see clear, precise, and concise RFC5424 logs. Why should I try to interpret whether exploding-head emoji is higher or lower in severity than sobbing emoji, when standardized severity levels have been defined?
bluelightning2k a year ago

I absolutely picture Richard Hendricks when reading this comment.
Made me smile.
Aeolun a year ago

I think its a generational thing? I use way more emoji in chat than anyone older, and I assume the same thing extends to logs.
- sgarland a year ago
  
  I use them in Slack and iMessage, no problem. I’m just massively against seeing them in the terminal, or logs.
  - Aeolun a year ago
    
    Yeah, I’m not a fan of it either, I’m just wondering if maybe the younger generation is.