Alastair Rushworth

How much AI do we need, really?

Alastair Rushworth — Sat, 13 Dec 2025 09:37:58 GMT

It’s been said that if all AI development paused and went no further than it has right now, there are still years of innovation and disruption and growth to come just by integrating what we have into products. The way the frontier labs and their investors are spending implies that the payoff of developing AI to some notional AGI end state is far greater than whatever value they could get in the shorter term by focussing more on product. Yes, I know ChatGPT is a product, and maybe it will become the operating system of everything in the future. But the bet being made doesn’t seem to be that, based on where labs are spending their money.

What I think is wrong about all of this AGI stuff is that most problems that actually need solving - and that someone will pay you to solve with a product - require a fixed amount of intelligence. Sometimes that’s low, sometimes that’s high. But there’s going to be a point past which developing the AI further doesn’t matter, because you can already do the thing automatically, after which it’s really about cost and convenience.

I don’t think this is just a lack of imagination on my part on how different the AGI future might end up being. Humans in 10 years will still need food, homes, recreation, financial products and access to information. All of these things currently operate in a clunky analogue world and doubtless will change a lot with AI, but none of these things are 4d Vulcan chess problems either. There will be a point past which the benefit from AI enhancement will be mostly saturated.

The problem here is that the labs don’t really have any durable advantage on the technology. Yes they’re well funded, and make splashy breakthroughs, but typically an open weight model will match their performance within 6-12 months. This is great for the product people who are building things, because you just need to focus on building, and wait for the cost of the intelligence level you need to be competed away in the cash furnace.

If you agree with this thesis and are wondering why this isn’t being discussed more, I think the reason is simply that a lot of the next generation of products are still being built, and we don’t yet know what level of intelligence is really needed. It maybe that for many of them, we already have it, for others maybe it’s a year or two away. But the point is that these thresholds are real, and make it harder to justify the investment case for continuing to scale AI training.

Custom RSS feeds

Alastair Rushworth — Thu, 02 Oct 2025 12:16:06 GMT

I’ve been building blaze.email for a couple of years now. Basically it’s an email digest that curates links to interesting tech blog posts from the past week, across a few different popular topics (gen AI, obvs). Part of doing this requires automatically parsing, categorising and creating text embeddings for each post so they can be screened for relevance. I’ve indexed about 1.3M blog posts over time and have been looking for ways to make all of this more useful, so I started by creating a basic blog search engine at blognerd.app:

Answering all the important questions

So far so standard, quite a few places already let you do this. A few distinguishing features are that you can

search similar posts or similar blogs by clicking the grey pills in the results
turn any content search into an RSS feed
search RSS feeds directly and find similar authors

Custom RSS

The big new thing here is the ability to customise feeds. It’s somewhat inspired by Yahoo! pipes which among other things let you mash together feeds from the web using a graphical interface. Here’s how it looks on blognerd.app

It’s pretty simple. You draw flowchart by dragging nodes onto the canvas. Every flow chart needs to start with at least one ‘Input’ which specifies a search query, and each flow needs to end with a single RSS output. You can add keyword filter steps to add / remove things you are / aren’t interested in. I think turning searches into feeds is powerful on it’s own, but I really like the ability to tweak RSS feeds and adding as many different topic sources as you like.

I’m leaving it completely open while I develop it, as I’m really keen to get any ideas or feedback on it. Please give me a shout if you have any thoughts - you can find me on Mastodon or Bluesky

Learning in the open, GPT5, AI news

Alastair Rushworth — Fri, 15 Aug 2025 14:26:17 GMT

In September I’ll be going freelance, working in the AI space. Although I’ve been kind of been working in the general area of traditional ML and data science for some years, there’s a lot to catchup on, so I’m taking some time to upskill, read and think. I’m pretty excited about this, learning is something I enjoy, though I never quite get the time to do enough of it. Which is why I’m starting this substack - more of a scratchpad than a traditional newsletter, I’m going to try to capture interesting things that I learn as I learn them. It’s also a way to log things and stay accountable. I expect it will be raw and unpolished, which in a world of increasingly polished and beige AI prose I think is a good thing - bad writing is the new good, sort of. My aim is to write authentically and honestly, primarily for myself, but I hope the struggle and journey resonates with others!

GPT-5 release

gpt 5 was released on Thursday 7th August - OpenAI’s official announcement posts don’t say too much. Simon Willison’s write-up “GPT-5: Key characteristics, pricing and model card” is a great place to start. The biggest changes seem to be simplified range of models (naming was out of hand before with o3, o3-mini, 4o, 4.1-nano etc.), and very keen pricing on input tokens. GPT 5 is half the price gpt-4o, and gpt-5-nano is 5c / M tokens (!). Not new to this release, but often overlooked batch pricing for asnychronous non-time sensitive tasks is even cheaper.

There’s sort of a catch with the pricing - by default these are all reasoning models, and the number of output reasoning tokens can be sizable. I flipped my gpt-4o-mini summarisation processes over to gpt-5-nano, and noticed a huge uptick in output tokens, which more than wiped out the per token savings and resulted in a doubling of costs! Updating the OpenAI python package and adding reasoning={"effort": "minimal"} seemed to make things more comparable to the non-reasoning equivalents.

Although the changes are great, and lower costs are awesome, it’s not the step up in capability or performance I’d expected with a major version bump. The Register commented that OpenAI's GPT-5 looks less like AI evolution and more like cost cutting - I’m not sure I’d go this far, but I wonder if it does point to a potential slowing in what can achieved under the current training paradigm.

Other AI news

Claude 4 1M token context in beta - for tier 4 users initially and more costly per token, and seems aimed at packing more context into code agents for larger repos, I hope we see this rolling out in Claude Code / your favourite agent soon. As a side note, it got me wondering about the large context needle-in-a-haystack problem, which may now be essentially solved, but doesn’t perhaps get discussed as much as it should.

Thomas Dohmke resigned as Github CEO 'to become a founder again'. I thought this was interesting, Dohmke always has insightful things to say about the future of software development, even although github copilot has been lagging a bit lately on code agents. It’ll be interesting to see both how this pans out for github as it moves more closely to it’s parent Microsoft and to whatever Dohmke decides to build next.

DeepMind releases Gemma 3 270M, a very small and very capable open weight model. I’ve been meaning to do something interesting with a small model for some time, this one might have come at the right time.

On using code copilots in data science

Alastair Rushworth — Sat, 29 Jun 2024 00:00:00 GMT

I’ve lost track of how long I’ve been using Github’s Copilot via the VSCode extension but it’s well over 18 months at this point. It’s been absolutely game-changing for my productivity and enjoyment of writing code, but I’m often surprised to find that other data scientists haven’t tried it (or an equivalent tool like codium, tabnine and cursor.sh). I’ve writing this to explain why I think it’s so important, and that it should be considered an essential part of a data scientist’s toolkit. If you don’t use a code copilot tool, I hope this provides a perspective on what you might be missing. If you do already, hopefully it resonates with your experiences.

Writing code is slow

The physical act of typing is an order of magnitude slower than the time it takes to decide what you intend to write. While typing, you need to remember how to use tools correctly to execute your intent (how does pandas .pivot() work again?), probably google some syntax (💀 Stack Overflow) and fix bugs or mistakes (ugh, forgot to .reset_index). There’s also exploratory work and data analysis & also iterating and refactoring (that didn’t work, let’s try something else), docstrings, comments, editing YAML files, yada yada. To say nothing of starting a new project and firing up your favourite modules and writing a first implementation of something.

All of this is just slow, compared to the time it takes to imagine what you are going to do. Ok, I know some of you can touch type, have black-belts in vim shortcuts and retain code docs in your photographic memories. But a majority of data science development time is spent punching out pretty standard code, and happily, this is exactly the type of code that copilots are excellent at anticipating with very good suggestions.

A slight philosophical segue. I think there’s an important distinction to be made between typing code, which is something you do with your fingers (and your web browser), and realising ideas with software which is something you mostly do with your brain. I’m not sure you can do one without the other, and the distinction isn’t totally crisp — but bear with me.

Code is cheap, thinking is expensive

Something we don’t talk about enough is just how much of the code we write ends up being thrown away. Projects change and sometimes die, ideas evolve and so does our code. But many (many) written lines of development, debugging, and EDA code are ephemeral and never even seen by another person.

It’s a mistake to think of discarded code as waste, it’s an essential part of development. As many authors have observed, writing is thinking and writing code is no different. This is most obvious to me in the case of EDA, where we learn by coding, thinking and iterating and much of this code doesn’t see the light of day. There’s more in the linked article, but I don’t at all think of EDA as an ‘analysis stage’ as it’s often framed, but more of process of thinking and investigation.

Anyways, finished code is a digital manifestation of an idea, and writing the code is a form of thinking that leads to that manifestation. There’s a ton of value in the thinking part — in general, more thinking should result in better ideas and finished software. Time is always a regularising factor on how much of this type of work can take place. Copilots act as an accelerant and multiplier that makes delivering better ideas easier, faster and maybe even delightful.

So what gives?

Why the resistance the uptake of this type of tool? There’re a few reasons I’ve observed. The main one I think is that it’s simply passed a lot of people by — the last 2 years have passed in a haze of loud AI hype and chatbots. During the same period of time, code copilots have gone from being interesting, cool toys to something completely game-changing. You only need to scan some of the [comments on hackernews when Github Copilot first launched](the comments on hackernews when Github Copilot first launched) to get a sense of what a step change it was.

Another reasonable objection is the risk of wrong / hallucinated code inadvertently getting pushed and causing issues. To anyone worried about this, I suggest trying one of the major copilots out for a while, the risk is much lower than one might imagine. The workflow prevents this to an extent — copilots are more like very smart autocomplete, where you have full control over whether a code suggestion is accepted or not. It’s not at all like blindly copy-pasting large code generations from ChatGPT (though I believe this also has a place, but that’s for a another article).

Speaking from personal experience, I used to be a bit precious about my code, and definitely felt defensive about the idea of copilots when I first started playing with tabnine (which has been around longer than most). I’m not sure but I think this attitude is fairly common and is probably reinforced by the emphasis many companies put on squeaky clean SWE practises when hiring. The myth of the 10x engineer polyglot who can write the full stack and do deep specialist development definitely doesn’t help either. There’s probably a lot more to it than that, but I hope you know what I’m talking about. I think all of these cultural threads add up to a kind of jealousness over our hard-won skills that results in a reflexive rejection of tools that might displace them. I sometimes need to remind myself that over sufficiently long time scales, much of our knowledge of syntax will be made redundant anyways — in 10 years time, I expect that half of the modules I routinely use now will have changed or been updated beyond recognition. Bottom line is that it’s good to care about code, but don’t let it get in the way of trying new ways to do it.

Wrap up: give it a go

My best advice would be to try out a copilot. I really like Github’s, because it integrates seamlessly with VSCode and it really hasn’t missed a beat since I first subscribed. It’s absolutely the easiest $10 I spend each month.

Why finding good tech blogs is hard

Alastair Rushworth — Fri, 21 Jun 2024 00:00:00 GMT

The internet is a very big place and discovering good things is still an unsolved problem. Some of the best writing lives in the fringes of the navigable internet in independent blogs that are difficult to surface unless you already know what you are looking for. I’ve spent a lot of time working on this topic, and believe it’s not obvious why the services that purport to solve this problem, don’t (and likely never will).

Firstly an important point of clarification: the title of this post is misleading, because I don’t think we should ever attempt to create sharp discrimination between ‘good’ and ‘bad’ content, or to create a service to serve up the ‘best’ content. I think of quality as being a statistic that’s only defined over a distribution of content (in the statistical sense of achieving some content diversity over some unseen axis). I’ll begin by expanding this point by explaining a few ways in which the content we end up reading is inevitably from a very specific type of distribution that’s far from ideal.

On filters

In order to read a thing on the internet, a decision was made that you should see that thing and not something else.

It’s important to recognise that wherever we find content, on social media, news sites or search engines, someone or something decided you should see it. Maybe you made the decision, because you rolled your own tool. Maybe an algorithm did it. Maybe you read something that was trending on hacker news. It’s not intrinsically a bad thing, and doesn’t always imply intent to manipulate you. The internet is simply too big for this not to be the case. But it’s crucial to realise that we never have unfettered access to all of the internet’s content, or even an unbiased sample of any subset of it, and this has consequences.

All of that might seem a bit obvious, but it feels necessary to assert before we go into detail about how these filters express themselves in different venues. What I try to do in the following sections is give an opinionated (but hopefully relatable) perspective on the experience of discovering content across a few popular services, and hopefully shed a little light on an old problem.

I think a simple way to think of the problem is of a Venn diagram showing relevance and quality as separate, but overlapping traits. Again, I don’t mean to imply that there’s an objective way of crisply measuring the quality of a single piece of content, this is just a simplified way of thinking about content in the aggregate.

A simple way to think about the performance of content recommendation.

By relevance I mean the set of content that is associated with your interests. On the other side is quality — the set of content that is good or worth reading. Quality is tricky to define, but it’s easy to describe what it isn’t: SEO blog posts, bland listicles, AI content farming etc. This is the bad quality stuff we’d rather avoid.

Our aim is to find as much content as possible that is both high-quality and relevant. Of course, every person will have a slightly different Venn diagram, particularly in the relevance set, but I think it’s also true that what might be quality for me, may not be for you.

I’ve also added a relevance ‘halo’ to represent the fuzzy set of content that might not be quite as relevant to your interests, but that if of sufficient quality, you’d still read. This is one of the most important parts of this diagram in my opinion, I’ve put it there to highlight a core problem in most recommendation systems: it’s implicitly assumed that personalisation is more important that all else. I think this is too simplistic — for example, if something is of good enough quality, (or important enough), then I don’t care whether it’s relevant. In other words, content is king. The breadth of each person’s halo varies, but you get the idea — we don’t always just want more of the same.

Search engines

Let’s get the obvious out of the way. Google and Bing are the only real players, and as custodians of indexes of the entire internet, are in principle well placed to serve content from any niche. However, we know already that this doesn’t work out in practise, at all. Google has been struggling with SEO spam in recent years and this struggle is being compounded by the rise of AI content. The bottom line is that getting relevant content from a search engine is easy, getting good quality requires a lot of patience. Moving on…

Search engines: huge breadth, but the average quality in search results is incredibly low.

Social media

I’m not here to criticise social media, that’s been very well covered by others. They are excellent tools as communication devices and for communities to organise and interact. It’s unavoidable that the financial objectives of social media companies result in incentive structures and consequent behaviours that do not maximise utility and well-being for users. To be specific, there are three particular limitations on the experienced user content diet.

Social media: Highly personalised, with a sprinkling of gems, but variety and coverage are low.

Problem 1: The popular masks the niche. What content you do discover is likely already popular with people similar to you. Almost by definition, niche content isn’t going to be popular enough to propagate over the network, and so most of what you could discover will be missed simply because it is niche.

Problem 2: User content is finite. In order for others to discover something, someone else must first share it. The internet is a very big place, and on any given social site, a lot of content that could be shared likely isn’t being shared there to begin with.

Problem 3: Personalisation is a trap. Part of the joy of discovery is finding something in a new area, on a challenging topic or from vibrant new authors. Statistically, such posts might look to ‘the algorithm’ as outside of your preferences and a less good bet for recommendation. It’s obvious why this makes sense for the social media company —if they did attempt to serve greater diversity, they’d have to risk lower average satisfaction with recommended content, and a degradation in their headline engagement metrics.

RSS readers

Curating a flow of content via RSS feeds has long been a go-to for power users. For those that don’t know, you can subscribe to almost any blog via an RSS feed which updates when new posts are published. Typically you’d use a client like feedly to manage your feeds and read posts. This allows you to keep up to date with any number of blogs you like to read.

With RSS readers, you won’t ever need to read any spam or low quality content (unless a site you subscribe to publishes some), and you can ensure everything is relevant by being selective with the feeds you choose. Sounds great, but there are a couple of big drawbacks.

Problem 1. You have to do discovery yourself. Clients like Feedly really don’t help much with deciding what to subscribe to. It’s fine if you just want to follow a few major news sites, you can find those quickly. Much harder if you want to cover a broad swathe of independent writers and to find new ones

Problem 2. Your feed reader is an unwieldy firehose with lots of irrelevant content. A creator might have a number of topics they like to publish articles on, maybe you are only interested in only one of them. If you subscribe to lots of blogs, you quickly end up with an explosion of articles that you need to screen by scrolling through manually.

Honourable mentions

There are some alternative search engines that tackle this problem head-on, and I’d be remiss not to mention them. Kagi is a subscription search engine service that works a bit like Google but has a stronger emphasis on higher quality, small web content. I’m a Kagi subscriber and I’ve found it refreshing and often more efficient than using google. Exa.ai uses embeddings to search the index. Most interesting to me personally is marginalia.nu which is free and specifically focusses on non-commercial, independent content. (Self-plug alert…) I’ve been working on blaze.email for over a year now, which offers a search engine and automated newsletter digests for tech content.

Each of these offer improvements over the bigger players, though I don’t believe any truly solves the quality problem (yet).

A final thought: an analogy with food

I’m certain someone thought of this before I did, but there’s a useful analogy to be made between the information we consume online and the food we eat. It’s almost a cliche that good health comes from a balanced and varied diet. Something very similar applies to our information diet — we require a level of diversity in what we read — diversity interpreted in the broadest sense of variety and heterogeneity.

What this might mean practically is that an ideal feed might appear less ‘palatable’ than the type built on engagement on a social media site, including articles that are longer and more challenging. In my view, the palatability is mostly a UI problem for an enterprising content company to solve. The UIs of most social sites are extremely basic, which is something they get away with by inflaming users with content that is designed to hit the brain stem with some reptile energy. But can we imagine a site, interface or application that rewards thinking critically and consuming from a broader outlet? Yeah, of course it’s possible.

That’s not to say you shouldn’t enjoy the occasional shitpost, but consumed responsibly within a balanced and varied diet.

Exploratory Data Analysis: what’s the point?

Alastair Rushworth — Tue, 12 May 2020 00:00:00 GMT

Exploratory data analysis or EDA is one of the most important but difficult to codify parts of the data science toolkit. True exploratory analysis is without a sharply definable objective and evades being formalised into a set of clear steps. Despite this, EDA is used in at least a few very typical ways that connect to downstream tasks like data cleaning and hypothesis generation. But perhaps most importantly, it’s an integral part of how we learn to frame our thinking as data scientists. This post attempts to offer some perspective on the less-discussed ways in which EDA develops our contextual understanding of a data analysis.

EDA for checking, validation and cleaning

Let’s get the obvious stuff out of the way first. Where a rough analysis plan is already in place, and some data has been assembled to support the analysis, a type of EDA serves to identify potential issues that might require remedial work before progressing. This is probably the most common type of exploratory analysis and is more closely linked to the goals of data cleaning than pure analysis and insight. This a big topic and I won’t attempt an exhaustive list here, but instead will describe a few of the most common tasks.

The most common check is for the correctness of column types. Depending on the data source, different issues might arise here, but you’ll be familiar with at least some of these. Integers incorrectly encoded as strings, strings encoded as dates, unordered categories encoded as integers. Sometimes a column that should be numeric has the very occasional string entry. There are as many causes as there are issues: perhaps you didn’t specify the correct schema when you read the data; or the data are encoded in ambiguous way that results in an inappropriate type; or maybe some earlier data manipulation induced an unintended problem.

We often check the prevalence of missing values and their dependence on other important features — usually because a lot of analysis methods do not handle missing values natively. Some columns may be totally unusable if they are mostly missing. Remedies here might include dropping or transform columns, imputing missing values, or choosing an algorithm that handles missingness out of the box.

Distribution, shift and relevance: it is important to inspect the distribution of values in each column — and consider whether these look how we’d expect (where we have an expectation). Do the distributions covary, especially with time (data are almost never consistent with stationarity with respect to time). Thinking about distributional shift is crucial for making decisions around which window of data is most important or relevant for addressing a specific question. It might expose or confirm trends and temporal patterns that downstream analysis needs to be aware of.

Measuring pairwise association provides some basic insights into how columns covary and might help reveal columns that are collinear or even identical that could be removed without detriment. It might help uncover some of the overall structure in the data or indicate collections of related columns. Pairwise association measures, like Pearson correlation coefficients, are overused in this context and are limited to only providing a linear and unconditional view of pairwise association. Nevertheless, a lot of insight can be gleaned from this type of analysis if you know what to look for.

These types of techniques provide a first look at the data and answer important questions about quality, formatting and overall dependence structure. These steps can usually be carried out by the data analyst without any external support, and are generally well supported with easy-to-use code wrappers. These are absolutely essential steps and it’s possible to learn quite a lot about the data by applying them and thinking carefully about the results. But it’s very important to recognise that there is a limit to how much can be understood with this type of analysis. There’s a lot more to EDA.

grokking the data with EDA

When you claim to “grok” some knowledge or technique, you are asserting that you have not merely learned it in a detached instrumental way but that it has become part of you, part of your identity.

‘grok’, The Jargon File

I’ve totally made up the heading, but I think it’s by far the most important role of EDA and is mostly what the rest of this post is about. There is a sort of myth of the data analyst as a robotic processor of data, who is detached and passive. The reality is completely the opposite, where better data analysis will always come from an analyst with a deep understanding of the data and the processes that generated it. EDA has a crucial role in turning a data frame from a contextless collection of bytes into a meaningful representation of a physical process, transitioning the analyst from the passive processor to an expert with deeply internalised understanding of an area. This end state is intangible and qualitative because it happens completely in your own head. Consequently, this part of the EDA will be a creative and personal journey that is supported by a continuing internal conversation that probes and revisits your understanding of the broader context.

Building a data narrative

The data frame you have in front of you for analysis is an incomplete and encoded representation of some real world process. Part of your role as an analyst is to solve problems and generate insights that respects the story of how the data were generated. For want of a better description, let’s call this story the data narrative. Part of this narrative might be the sequencing of events that lead to each data record coming into existence, part of it might be the data’s lineage in terms of the processing, joins and wrangling required to produce the data frame you end up with. If you are already an expert in the area you are working, this narrative may already be engrained in you. The data narrative completely frames the work you do, how you interpret every insight or modeled output, and most importantly, the credibility with which you can influence your audience.

The data narrative is a complex form of metadata and is almost never part of the data frame. If you are fortunate, your organisation might keep clear and accessible documentation and data dictionaries that will be a huge first step to piecing together this narrative. However, it is often more typical that analysts are neither domain experts nor well-provided with nice documentation. In this case, the narrative is something that must be synthesised through detective work, drawing on a combination of data analysis and the experience of domain experts. This is, of course, much easier said than done.

The role of asking questions

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

R for Data Science, Hadley Wickham (2021)

EDA can’t happen in a computational vacuum. To do this well, we need to be alternating between interrogating the data and asking ourselves if what we find is consistent with our internal understanding of the data’s narrative. Does what I see make sense to me? Would I feel comfortable to explain it to someone else?#

In large organisations, you might also be speaking with a domain expert to help with this (if that person isn’t you), though they needn’t be an internal expert if the data come from outside the organisation. If you don’t yet have direct access to such a person, demand that you do — this person will supercharge your eventual analysis and will be often be the difference between success or failure of the entire project. In the beginning, take lots of time to let experts talk more broadly about the data, as they understand all of the salient dependencies, anomalies and gotchas that will save you a lot of time in the long run. Take time to use simple data analysis to carefully confirm what you’re told. The key here is not checking for correctness, but to grow your understanding of the data: it’s important to remember that it’s one thing to be told something about the data narrative, but it’s much more meaningful to use your own analysis to see it expressed in the data.

As your understanding deepens and the analysis progresses, you’ll continue to find new patterns and structure in the data. Keep revisiting your understanding of the data narrative, and check whether what you are seeing is consistent with that. As your understanding of the data narrative matures, the gaps will come into focus: consider creative ways to use the data or ask a relevant question to close the gap. The relationship between internalised data narrative and data exploration is a two-way street.

Take time to talk your findings over with another data scientist. The key here is to aim to communicate your understanding of the data narrative without getting too mired in the technical details of the data. The process of preparing a narrative that you can explain to a colleague will help to consolidate what you’ve learned and quickly expose gaps. A fresh set of eyes will nearly always raise further questions or force you to think of your data from a different perspective.

Do we even have the right data?

An important byproduct of the process of building a better data narrative is that your understanding of what the most important or relevant questions to ask will improve. A crucial question to keep revisiting is whether the data you have is sufficient to address the most important questions. Are there additional data sources that you draw upon to enrich or improve the analysis? Are the columns you already have in your data frame defined correctly, or should they really be specified differently? It’s typical that data sets are assembled before anyone knows exactly how the data will be used and it can pay dividends to constantly revisit the question of whether the data contains everything sufficient to answer a particular question. Many problems in data science are much more easily solved by gathering the right data (or more of it) than by using fancier techniques.

Data hygiene and data splitting

If you frequently fit predictive models, you’ll be aware of the risks of overfitting and the need to reserve partitions of the data to check that your findings truly generalise to unseen data. The same is true for the iterative types of EDA discussed in this article. The more detailed your analysis is, the higher the risk that insights gleaned in your EDA are false discoveries (aka statistical flukes). It is important that the confirmatory part of your analysis (prediction accuracy measurement or hypothesis testing) occurs on a different piece of data to your EDA.

A related problem that frequently arises in machine learning projects is where EDA is run as a preliminary step before creating training and test splits. If the result of EDA influences your model choices (it nearly always will if done properly), then you’ve potentially reduced your test set’s ability to measure true out-of-sample error. So before you do anything, create a hygienic environment for your EDA by splitting your data, so that you don’t accidentally leak information from your test set into your model.

Creativity and the pitfalls of the data frame API

This was the tendency of jobs to be adapted to tools, rather than adapting tools to jobs.

Silvan Tomkins, Computer Simulation of Personality: Frontier of Psychological Theory (1963)

Almost all data analysis now begins with some form of data frame — a tabular data format with columns of mixed types, where each row is a record. In Python and R, data analysis tooling has coalesced around the data frame object, which has been a huge convenience and productivity boost for the analyst. I wouldn’t for a second debate that this hasn’t been a positive development, but there is a risk here that EDA, because of the ease and uniformity of use of the tooling, becomes an exercise in applying boilerplate code. This creates a hidden creativity trap where the analysis can become narrowed by the range of uses supported by a particular set of tools. While such tools are extremely powerful when they are genuinely supporting you to develop your understanding of the data narrative, it’s important to avoid becoming too reliant on any single tool.

My experience is that it’s good to have familiarity with tooling at multiple levels of abstraction. Extremely high level interfaces to auto-generate certain types of exploratory analysis are very handy, and big time savers when they provide just what you need. However, the majority of EDA is more creative in nature and becoming expert with data manipulation tools like dplyr and pandas in combination with graphical tools like matplotlib and ggplot2 provides much finer control and fewer restrictions on your creativity.

The main point here is that exploratory data analysis can’t and shouldn’t be automated, because it is a process to support a human (you) to learn, and to do that well, there are few shortcuts.

Closing thought: an analogy with critical reading and literary analysis

Like all good blog posts, my thinking on EDA began on Twitter. In the process, Jesse Mostipak made a great point that teaching EDA effectively might share similarities to the way students are taught to interrogate literary texts. I’d never considered EDA this way, but the analogy resonated strongly with me, and much of my thinking in this post owes a lot to being sent off in this direction, 🙏 thanks Jesse! There’s a lot to unpack in the analogy, and I have no training in critical reading so I can’t speak with any authority on that subject. Nevertheless, it seems that interrogating a text has broad similar to EDA in the sense of being driven by the goal of developing a deep understanding of a text.

This article by the Farnham Street blog summarises four levels of critical reading, originally proposed by Mortimer Adler. The final most analytical form of interrogation, called synotopical or comparative reading, hits on some of the themes I’ve discussed already:

This task is undertaken by identifying relevant passages, translating the terminology, framing and ordering the questions that need answering, defining the issues, and having a conversation with the responses.
The goal is not to achieve an overall understanding of any particular book, but rather to understand the subject and develop a deep fluency.
This is all about identifying and filling in your knowledge gaps.

Farnham Street blog, How to Read a Book: The Ultimate Guide by Mortimer Adler

Sounds familiar doesn’t it? Asking questions (of yourself), contextualising and framing, closing knowledge gaps and achieving fluency are all key parts of a successful EDA. What I’m most excited about here is that we can draw on the analytical framework of an existing and well-established discipline, as scaffolding to think about how we can make improvements to the way we teach and practice EDA. Again, full credit to Jesse for this idea.