How to Ask Programming Questions: The Importance of Defining Data Over Code

Posted by : on

Category : General

Introduction

In the previous post Data Driven Code Design, I explained why data is fundamental in programming. Data represents the problem, while code is the solution. In every programming challenge, data comes in, our code transforms it, and then data comes out. This transformation is the essence of problem-solving in programming.

In this post, I will discuss the best way to ask programming questions. This applies not only to asking questions in forums, websites, or Discord servers but also to how you frame questions for yourself when tackling a programming problem. I will also provide a code example that demonstrates a poorly framed question. This example will highlight the importance of clearly defining the problem rather than focusing solely on the solution.

Ultimately, the problem always revolves around data, the data we have and the data we want. The code simply serves as the means to achieve that transformation.

The Wrong Question

When asking programming-related questions, many people focus too much on the solution rather than the problem itself. This is understandable, after spending significant time working on a piece of code, it’s natural to become narrowly focused on its implementation rather than stepping back to analyze the broader issue.

This often leads to an XY problem. In this situation, a person is trying to solve problem X, but they assume that solution Y is the best approach. Instead of asking about X, they ask how to implement Y. If Y is straightforward to implement, there’s no issue. However, if it proves difficult, the focus shifts to solving Y, which may not even be necessary. When a solution becomes too complex, it’s often a sign that we should revisit problem X and explore alternative, simpler solutions.

In this analogy, X represents the data: the initial input and the desired output. Y represents the code, the transformation applied to the data.

Being specific about the code we’ve written, explaining what it does, and modifying it to fit our intended transformation is useful. However, this approach can also narrow our focus. Instead of seeking a solution to the real problem, we end up refining a specific implementation of a possible solution. This shift often results in a question that is less about the problem itself and more about a particular coding approach.

If a solution is difficult to implement, asking how to fix or optimize it is often the wrong question. Even if we provide details about how and where the code will be used, these details mean little without first defining the real problem. The key is to clearly state:

  • Why the code exists
  • What data it takes as input
  • What data it should produce as output

The Right Question

Defining the data we expect to have and the data we want to have in the end, is the right question. The code with the solution is a nice to have, because it provides information about one possible solution we have tried, but it is not the important part.

A correct question should always start and end with data. What we have at the beginning, what we have at the end and what we expected to have. A detailed description of those, being as specific we can be about that data, is the most important part that makes our question detailed and specific. A detailed and specific description about the code we have written, although useful, will leave our question abstract and generic, and anyone trying to answer it making educated guesses about the data this piece of code will be trying to handle.

This pattern is common in programming forums. Even when a question is detailed about the code, if it lacks information about the data, responses will either be based on assumptions or require clarification with the following standard questions:

  • What is your input?
  • What was the result?
  • What was the expected result?

Or, framed in terms of data:

  • What data was provided as input?
  • What data was produced as output?
  • What data should have been the output?

To make this discussion more concrete, I will present an example of a poorly framed question. Although entirely fictional, it illustrates how even a detailed question about code can be incomplete if it fails to describe the data. Without that crucial context, answers will vary widely depending on the assumed input data.

Example Of a Bad Question

Let’s suppose someone is wondering whether to perform a particular operation synchronously or asynchronously when their primary concern is speed.

Since this question is quite generic, they clarify that they are not just referring to asynchronous execution but also to running the code in parallel across multiple CPU cores. However, this is still somewhat vague, so they provide a specific code snippet, which is quite simple:

Interlocked.Increment(ref value);

They further elaborate by stating that this will be the only operation running. For those unfamiliar with Interlocked, it is a class that ensures atomic operations on the value variable. This not only guarantees thread safety but also ensures that, in this particular example, the compiled code remains identical, as Interlocked.Increment is an atomic low-level machine operation. The Increment method simply increases a number by one.

At this point, we have all the details about the code being used. However, to eliminate any ambiguity, a full example is still provided. Now, the question becomes: which approach, synchronous or parallel execution, will yield better performance? Which method will execute faster, given the following implementation?

private static void IncrementInterlocked(ref int value)
{ 
    Interlocked.Increment(ref value);
}

This code will be used to increase two numbers by one, because the primary concern is speed he asks which of the following will be better in the context of performance, this which executes synchronously:

public void InvokeSynchronouslyWithInterlocked()
{
    IncrementInterlocked(ref _value1);
    IncrementInterlocked(ref _value2);
}

or this, that executes in parallel:

public void InvokeWithInterlocked()
{
    Parallel.Invoke(
        () => IncrementInterlocked(ref _value1),
        () => IncrementInterlocked(ref _value2)
    );
}

The person asking the question even describes the execution conditions in detail: There will be no issue with available CPU cores, the parallel code will always have dedicated cores to run on the target machine, and the synchronous code will execute on a core within the same machine. We are not discussing different hardware; in fact, we are considering an average desktop PC, where we can be certain that if parallel execution is used, there will always be available cores to run the tasks concurrently.

Despite this level of detail, the question is still poorly framed. While the person asking believes they have been as specific and thorough as possible, this is only true for the answer, the provided code. But what about the actual question? What are the value1 and value2 numbers? Without this crucial information, the question remains incomplete.

The Corrected Question

With the details provided so far, any answer would be incorrect:

  • If someone claims the parallel code will be faster, they are wrong.
  • If they claim the parallel code will be slower, they are also wrong.
  • If they claim both approaches will take roughly the same time, they are still wrong.

The truth is, we need empirical evidence to determine the correct answer. The best way to address performance-related questions is through benchmarking,far more reliable than theoretical reasoning. However, for a benchmark to provide meaningful results it must be tested with the correct data, because ultimately, data is always the problem we need to solve.

To have more accurate results, I wrapped the Interlocked.Increment in a for loop:

private static void IncrementInterlocked(ref int value)
{ 
    for(int i=0; i < 100_000_000; i++)
    {
        Interlocked.Increment(ref value);
    }
}

Then I created two Benchmarks:

[Benchmark]
public void InvokeSynchronouslyWithInterlocked()
{
    IncrementInterlocked(ref _values[0]);
    IncrementInterlocked(ref _values[Index]);
}
    
[Benchmark]
public void InvokeWithInterlocked()
{
    Parallel.Invoke(
        () => IncrementInterlocked(ref _values[0]),
        () => IncrementInterlocked(ref _values[Index])
    );
}

The values here is our data, and I have defined it like this for our benchmark:

private readonly int[] _values = new int[100];

the Index is helpful to allow us to run our benchmark two times, with two different parameters:

[Params(1, 99)] public int Index;

This benchmark will run twice: once by incrementing the first and second elements of an array and once by incrementing the first and last elements.

As you may have guessed by now, the results of these two benchmarks will differ significantly depending on which elements are being modified. When incrementing the first and second elements, the parallel code is approximately 3× slower than the synchronous code. In contrast, when incrementing the first and last elements, the parallel code is about 2× faster.

This demonstrates the importance of data over code, not just in benchmarking, where different inputs are necessary for accurate results, but in every problem where we seek a solution. This phenomenon isn’t exclusive to arrays; it can occur in various scenarios. However, to fully grasp why it happens, we first need to understand the underlying cause. As this post emphasizes, a detailed description of the data: its type, representation, and storage, is crucial.

Why This Happens

The reason the benchmark results vary based on element positioning lies in how modern hardware transfers data from memory to the cache. For performance reasons, memory is moved to the cache in fixed-size chunks of 64 bytes, known as cache lines. You can read more about this concept in CPU cache Operation.

When two cores attempt to increment two consecutive elements in an array that reside in the same cache line, they cannot access this memory simultaneously. Instead, the hardware locks access to the memory, performs the operation, unlocks it, then repeats the process for the second core. This constant locking and unlocking at the hardware level makes the entire operation slower than simply executing it on a single core.

This issue is not exclusive to arrays. Arrays are a clear example because they occupy contiguous memory regions, but the same behavior occurs if integers are stored as consecutive fields in a struct, in which case, the parallel code will also be slower:

public struct Numbers
{
    public int First;
    public int Second;
}

However, this issue does not occur when integers are fields in different classes, as they are typically stored in separate memory locations that are further than 64 bytes.

Conclusion

Does this mean everyone should have deep knowledge of low-level hardware implementations? Of course not, if they did, they wouldn’t need to ask the question in the first place. What it does mean, however, is that everyone should recognize the importance of data when asking the right kind of question.

No amount of detailed code descriptions can compensate for a lack of information about the data. A well-structured question should clearly define:

  • The data being used
  • How the data is represented and stored
  • The expected transformation and desired outcome

Without these details, anyone attempting to answer the question can only make educated guesses about the data. The person asking may get lucky, and someone might guess correctly, but they might not. Instead of leaving things to chance, respect your time. A precise and detailed description of the data you are working with, along with the data you want to produce, will lead to faster, more accurate solutions.

Thank you for reading, and if you have any questions or comments you can use the comments section, or contact me directly via the contact form or by email. Also, if you don’t want to miss any of the new blog posts, you can subscribe to my newsletter or the RSS feed.


Follow me: