29 Mar 2020

Time series forecasting projects

History / Edit / PDF / EPUB / BIB / 4 min read (~783 words)

What are the general steps of a time series forecasting project?

Using a tool such as pandas-profiling, the dataset provided by the client is profiled and a variety of summary statistics produced, such as the min, mean, median, max, quartiles, number of samples, number of zeros, missing values, etc. are computed for numerical values. Other types of data also have their own set of properties computed.

These summary statistics allow you to quickly have a glance at the data. You will want to look for missing values to assess whether there's a problem with the provided data. Sometimes missing data can imply that you should use the prior value that was set. Sometimes it means that the data isn't available, which can be an issue and may require you to do some form of data imputation down the road.

Common things to look for in time series data are gaps in data (periods where no data has been recorded), the trend/seasonality/residual decomposition per time series, the autocorrelation and partial autocorrelation plots, distribution of values grouped by a certain period (by month, by week, by day, by day of the week, by hour), line/scatter plots of values grouped by the same periods.

Data is rarely clean and ready to be consumed. This means many things: removing invalid values, converting invalid values or values out of range into a valid range, splitting cells that have multiple values in them into separate cells (e.g., "10 cm" split into "10" and "cm").

A variety of transformations can be applied to the cleaned data, ranging from data imputation (setting values where values are missing using available data), applying a function on the data, such as power, log or square root transform, differencing (computing the difference with the prior value), going from time zoned date time to timestamps, etc.

Common feature generation transformations are applied, such as computing lagged values on variables, moving averages/median, exponential moving averages, extracting the latest min/max, counting the number of peaks encountered so far, etc. Feature generation is where you create additional information for your model to consume with the hope that it will provide it some signal it can make use of.

Before attempting to find a good model for the problem at hand you want to start with simple/naive models. The time series naive model simply predicts the future by using the latest value as its prediction.

With a baseline established, you can now run a variety of experiments, which generally means trying different models on the same dataset while evaluating them the same way (same training/validation splits). In time series, we do cross-validation by creating a train/validation split where the validation split (i.e., the samples in the validation set) occurs temporally after the training split. The cross-validation split represents different points in time at which the models are trained and evaluated for their performance.

After you've completed a few experiments you'll have a variety of results to analyze. You will want to look at your primary performance metric, which is generally defined as an error metric you are trying to minimize. Examples of error metrics are MAE, MSE, RMSE, MAPE, SMAPE, WAPE, MASE. Performance is evaluated on your validation data (out-of-sample) and lets you have an idea of how the model will perform on data it hasn't seen during training, which closely replicates the situation you will encounter in production.

With many models and their respective primary metric computed, you can pick the one which has produced the lowest error on many cross-validation train/test splits.

Once the model has been selected, it is packaged to be deployed. This generally implies something as simple as pickling the model object and loading it in the remote environment so it can be used to do predictions.

There are two modes of forecasting:

  • Offline: Data used for forecasting is collected during a period of time and then a scheduled task uses this newly available data to create new forecasts. This is generally used for systems with large amounts of data where the forecasts are not needed in real-time, such as forecasting tomorrow's stock price, the minimum and maximum temperature, the volume of stocks that will be sold during the week, etc.
  • Online: Data used for forecasting is given to the model and predictions are expected to be returned within a short time frame, on the order of less than a second to a minute.

Raw data is transformed and feature engineered, then given to the model to use to forecast.

28 Mar 2020

Writing with simple vocabulary

History / Edit / PDF / EPUB / BIB / 1 min read (~147 words)

Why should I write using simple, frequently used words?

Using simple language will allow more people to understand your message.

Using simple words to explain ideas that are more involved makes it easier to understand those ideas.

It's also easier to identify errors in reasoning when you're expressing yourself with simple language.

Using rare words does not mean that you are more intelligent or have smart thoughts. It simply means you're trying to conceal yourself by using words others may not understand.

Like writing software, you should aim to keep your writing simple. It makes it easier on the readers that don't have to spend their time to understand what you're trying to say.

Writing such articles is very difficult. For example, this article was written using only the 5000 most common words according to Wiktionary.

27 Mar 2020

R&D developer

History / Edit / PDF / EPUB / BIB / 2 min read (~249 words)

How is being a R&D developer different than being a developer?

R&D developers are not focused on shipping. While most developers will work as hard as possible to ship whatever they are building to their customers so that they can get paid, R&D developers focus on delivering answers to questions asked by their clients. This focus on intangible deliverables will frustrate many developers.

Because R&D developers focus on answering questions and not building products, it is very common that code written will not make it in production. If it does, then it will generally be a catastrophe.

Code quality and maintenance are not considered a priority because code is expected to be abandoned once the questions have been answered and the solution has been proven useful.

R&D as it implies, is about finding solutions quickly to problems (research), building a solution (development) and demonstrating the value of the solution. This process is a lot more iterative than building software with (somewhat) clear requirements from the start. Given the novelty of what gets built, it is critical to get feedback early and to act on this feedback. This means that the development horizon (how far ahead things are planned) is very short. As such, you are unlikely to be able to say on what you will be working next month.

Regular development is about applying existing solutions to clients problems. R&D is about finding those solutions and turning them into mainstream solutions.

26 Mar 2020

Adding habits to your life

History / Edit / PDF / EPUB / BIB / 2 min read (~352 words)

How can I effectively and consistently add habits into my life?

I use the Loop Habit Tracker (an android app) to track any new habit I want to have and keep. Its purpose is two-fold: to remind me through notifications that I need to do something and to observe how consistent I am with the habit.

When adding new habits, I've found I was more successful by creating transition habits, that is, start with something that is easily achievable and is similar to the habit I want to have, then slowly transition the habit to be closer and closer to the habit I want to have. An example of this might be that I want to do 20 minutes of jogging daily, but since I've never done jogging consistently in the past, I should start with 1 minute instead of 20 and do it consistently. After a week of consistently jogging 1 minute per day, I can increase the habit to be 2 minutes. Each week that goes by the amount of jogging that is done is increasing while the habit is in its formation phase.

It may take up to 20 weeks to do 20 minutes consistently every day, which is preferable to me to trying to do 20 minutes of jogging right from the start and giving up after a few times because my body is not accustomed to such effort.

This same metaphor can be applied to mental efforts. If you're not used to spending hours of focused effort on a task, trying to do it right away is likely to be very difficult. But if you slowly transition from doing none of it, to doing it a little bit, then more and more, until you reach your target, it will make something that initially appeared impossible manageable.

As you add more and more habits into your life, it may become difficult to keep doing all of them regularly without missing them. That is why an application such as Loop Habit Tracker will help you remember to do the habits you want to have.

25 Mar 2020

Tech lead

History / Edit / PDF / EPUB / BIB / 2 min read (~387 words)

Do you need a tech lead in your team?

Let's start with definitions of the tech lead role.

A Tech Lead is a software engineer, responsible for leading a development team, and responsible for the quality of its technical deliverables.

Source: https://www.thekua.com/atwork/2014/11/the-definition-of-a-tech-lead/

  • Guiding the project technical vision;
  • Analyzing risks and cross-functional requirements;
  • Coaching less experienced people;
  • Bridging communication between stakeholders and the team.

Source: http://vvgomes.com/we-dont-need-tech-leads/

  • Lead with company values
  • Deliver value to customers
  • Keep the dream alive

Source: https://hackernoon.com/whats-the-role-of-a-tech-lead-7725b47104b7

I am of the opinion that the distribution of responsibility is likely the best way to get resilience in your system. But with it comes the cost of delays before eventual consistency.

Thus I am more likely to adopt a position where having or not a tech lead will depend on the situation of your team.

Do you need to make quick decisions? Either have a tech lead for that or limit the amount of time allocated for a group of individuals to make decisions.

Do you need accountability? Either have a tech lead that is accountable or have important decisions assessed as a group and the results of the decision written with the name of those that participated in that decision.

Do you need to have a technical vision? Either have a tech lead responsible for defining that vision with the team or have the team work as a whole to define this vision.

Tech leads should have a high-level overview of the pieces that need to be built and an idea of how to get there and when. As individuals, this would require coordinating between individuals with different opinions about those topics.

I work in AI, and this problem makes me think of having a single model (tech lead) vs an ensemble model (group of contributors). If your single model generally predicts the same thing your ensemble model would predict, then the single model is more efficient. On the other hand, if there's no single model that can perform as well as the ensemble, then you should go with the ensemble model.