My First Co-op Experience at NationGraph

How I saw (and almost conquered) the chaos of the Internet

2025-09-09

The Internet is a messy place. Apart from all the cat pictures, it’s full of forgotten or unmaintained websites, vulnerable servers held together by duct tape, and malformed HTML.

This is the story of how I dealt with this mess as part of my first co-op at NationGraph and everything I learned, from UNIX file descriptors to public sector procurement to… traffic light control systems?

A beautiful view of Toronto from the NationGraph office

The Work

I was mainly responsible to create systems that gather and normalize huge amounts of data related to the SLED market (State, Local, and EDucation) in the United States. Spoiler alert: my infrastructure ended up aggregating about 23TB of data weekly (that’s a little more than 3TB every day). This data, after going through a whole lot of postprocessing, would be directly used to create insights for customers, helping them sell to the government.

Technically, this task was challenging (in the positive way). I needed a way to methodically discover and learn from hundreds of thousands of different websites, all of them having a completely arbitrary structure. The saving grace was, unsurprisingly, AI agents. In theory, with a little prompt magic and good pre-processing, a sufficiently good large language model should be able to extract the information you want from a web page.

But, uh oh, what’s that? Stuff breaks? Websites are messy? AI is expensive? Sounds like trouble…

In the end, a lot of my work was oriented towards solving those three problems. The main goals for my systems were, in this order:

Don’t break. No matter if a server freezes, slows down, goes down, changes, disappears, bans you, lies, serves you 3KB or 300GB of data, don’t break. No memory leaks, no OOM errors, no hangs, no filling up the entire the disk. Prepare for everything to happen — because everything will happen.
Gather as much (good) data. You can gather data from 40% of all target websites? Good, make it 80%. You made it to 80%? Great, make it 95%.
Don’t drain the bank. We had a saying that goes: “Everything you want is just one more LLM call away”. It’s easy to solve everything by throwing an AI call at it, but, at this scale, it’s expensive. How can we avoid large recurring costs by potentially accepting a higher initial cost? Where can we replace expensive LLM calls by heuristics or something else in-house?

Toronto Office

One word: foooooood.

Steak frites for lunch, provided by the company

Steak frites for lunch, provided by the company. What more could you ask for?

The Lessons

The number of things I learned during this internship is immeasurable, and I feel as though I’m slowly maturing as a budding software engineer.

Although my lessons are numerous, I will keep it to only technical lessons here, since it’s my blog and I get to write whatever I want.

Inlined Code

“The fly-by-wire flight software for the Saab Gripen (a lightweight fighter) went a step further. It disallowed both subroutine calls and backward branches, except for the one at the bottom of the main loop. Control flow went forward only. […] No bug has ever been found in the ‘released for flight’ versions of that code.”
— Henry Spencer, as cited by John Carmack

Ever since hearing about this anecdote in Eskin Steenberg’s great talk about C programming style , where he advocates for long, procedural functions for their readability and simplicity, I’ve been more self-conscious about the call depth of the software I write.

During my internship, I was able to exercise and embrace this programming principle for the first time (although not the extreme, “no subroutine calls allowed” version), and I found it quite appropriate. It made me question each level of abstraction I created (and, as it turns out, a lot of them are unnecessary). Focusing on keeping code procedural felt very natural given that I was essentially building a data pipeline, and it helped me clean up my projects, which, in turn, made my offboarding smoother.

A picture of flowers I took in Toronto

Some beautiful flowers from my time in Toronto, because I write too much about technical stuff

Other Lessons

Here are some other random things I learned in no particular order:

Web servers lie. They can put pretty much whatever they want in HTTP response headers , for example. Yes, this happened to me.
Open socket connections count as file descriptors , and there’s a limit on how many file descriptors can exist at any given time. Yes, this also happened to me.
The City of Toronto operates 2534 traffic lights. They run on 3 different non-compatible control systems: TransSuite, SCOOT, and SCATS. I fell into this rabbit hole while investigating Open Data Portals as a source of data, where I found this dataset from the City of Toronto.
CPython caches all integers from -5 to 256. For example, all instances of the number 42 within a CPython runtime refer to the same 42. Here’s a funny result of this:

a = 256
b = 256
a is b # >> True

a = 257
b = 257
a is b # >> False

Conclusion

I had a great time at NationGraph. I worked with some of the smartest people I know in an office with great work culture (arcade bar Thursday night anyone?) I leave now feeling like a more matured human and developer.

Now, off to a study term at uni!