- Start the longest tests first. If you have for example 2 tests, a short one and a long one, the long one will be the sequential bottleneck. If you have many additional short tests, they will run in parallel to the longer one. If you have 2 workers, 1 will be dedicated to the long task while the other will process many of the short tests. If the total execution time of the short tasks is greater than the longer task then the total execution time will be at least the duration of the long task.
- If we were not to schedule the longest tests first, we run the risk of running many short tests on a worker and then running the longest test, which may be suboptimal.
- As you accumulate many short tests this issue is mitigated. However you have to be careful of distributing your longest tasks on separate workers as much as possible.
- If you know the exact duration of a test (or an approximation of it, given a desired quantum), determine if you can solve the scheduling problem in a reasonable amount of time such that solving the scheduling and running the scheduled tests takes less time than running the tests.
- From the standpoint of execution, we will prefer to have the best response time possible (i.e., run the shortest tasks first to get most of the tests executed as soon as possible).
- As you have many short and long tests, it may make sense to run all the short tests first and keep all the long tests for the end, ignoring the heuristics that suggests running the longest tests first.
- Other important consideration when running tests are shared environments/resources between tests. If those take a non-neglectable amount of time to setup/teardown and could be reused, it may be beneficial to group such tests on the same worker in order to avoid paying the cost of setup/teardown.
- Setup/teardown duration should be recorded and be associated to each test.
- Some system have different types of setup/teardown: global, per class/module/scope. Being able to be aware of whether the test system will need to re-run a set of setup/teardown functions may help with scheduling the tests.
- While we may be collecting test duration data, it is important to be aware that this data may end up being tainted if used in a continuous integration system. What this means is that we may sometimes record durations that are out of the ordinary due to a bug in the code, or because the tests execute on different hardware.
- To mitigate this issue we may use different approaches such as using the median of the recorded values (which is more robust to outliers than the mean), use a percentile (such as 95%) to determine the expected duration, record the duration of successful tests separately from failed ones, record tests in separate buckets defined by the user (e.g., defined on the CPU spec used to run the test), etc.
- If the number/list of tests hasn't changed since the last run, it would also be beneficial to store the computed schedule so that it is only computed once and reused many times, which is a common use case. The only time you may not want to do this is if computing the schedule is very cheap and represent a neglectable amount of time in the overall testing process.
- By default pytest does not store any prior test duration so we would have to estimate that all tests are of equal duration.
- As we run tests, we may start to collect information about the duration of parameterized tests. This may serve us to determine whether the parameterized test has the same duration over different set of parameters and serve as a base to start providing estimates for the remaining parameters combination of this test.
- A system similar to the cache provided by pytest may be used to record the duration of tests
- Scheduler: responsible for taking a set of tests with meta data and scheduling them (i.e., determine in which order they will be executed).
- test: the unit of code to execute.
- test duration: the duration of a test. Test duration may be represented as a distribution since tests generally do not complete in an exact and fixed amount of time.
- test unit: a collection of tests that we can assume as being the same test (e.g., parameterized tests).
- Do you understand what is at stake?
- Do you know what already exists?
- Can you take an informed decision?
- What did you consider during your decision?
- What are your unknowns (things you'd need to know before your can make a decision)?
-
Do you have experience with the problem which needs a decision?
-
Indicate who has the most expertise to make the decision.
- You should be allowed to cast a vote if the decision will impact you. If it doesn't (i.e., you're not a stakeholder), your vote will not be counted towards any option.
Let's start by saying I'm not suggesting you read a full book per day. What I'm suggesting is to read at least a few pages of a book per day, reading a variety of books over the course of a week.
For a while I used to start book and finish them before starting another one. I'd allow myself to read a fiction book and a technical book at the same time, but not more than that. The idea was that by reading more than one of each my brain would have trouble with context and information retention.
I've recently decided to switch this approach. The main reason was that I found myself spending too much time reading articles online that I thought didn't bring me much value over time. I always thought books were more valuable, but their biggest problem was that it required a good amount of time involvement for the value to kick in.
Just like there are two strategies in learning systems, exploration and exploitation, I decided that leaning more on the exploration side might be more useful. In learning theory, "exploration" refers to trying new things or information, while "exploitation" involves deepening knowledge in known areas. Instead of spending hours on the same book over a short period of time (1-3 months), I would instead read bits of many books at once.
Here are the benefits I've observed through this approach:
It's easier to identify similar sources. I would read a few books on a similar topic, and of course they would all cite the same sources. The difference between processing all those books in parallel instead of sequentially is that you notice the pattern of reuse more clearly. When reading the books sequentially, what happens is that this type of information decays over time. We start to forget what the last book was referring to, so that the next book appears to have new references.
Similar ideas can be identified and speed up reading. As you identify the same ideas in different books, instead of reading the arguments careful in each book, the best argument is read thoroughly and the others quickly scanned for additional information.
You are exposed to more variety. Some people get topic fatigue, which is that you get bored of reading on the same topic. Reading on different topics avoids this issue while also stimulating you to think about many topics. This is a great way to sometimes make connections between unrelated topics.
Overall I've been very satisfied with this experiment and I've been doing it for over 4 months now. I highly recommend it if you have a large list of books you haven't started yet. See my article How to prioritize which book to read to help you organize your reading.
Tesseract (an open source OCR engine) supports a TSV format as output. I looked online for some documentation about the columns but couldn't find anything, so I looked at the source code.
Here is a summary description of each column, what they represent, and the range of valid values they can have.
- level: hierarchical layout (a word is in a line, which is in a paragraph, which is in a block, which is in a page), a value from 1 to 5
- 1: page
- 2: block
- 3: paragraph
- 4: line
- 5: word
- page_num: when provided with a list of images, indicates the number of the file, when provided with a multi-pages document, indicates the page number, starting from 1
- block_num: block number within the page, starting from 0
- par_num: paragraph number within the block, starting from 0
- line_num: line number within the paragraph, starting from 0
- word_num: word number within the line, starting from 0
- left: x coordinate in pixels of the text bounding box top left corner, starting from the left of the image
- top: y coordinate in pixels of the text bounding box top left corner, starting from the top of the image
- width: width of the text bounding box in pixels
- height: height of the text bounding box in pixels
- conf: confidence value, from 0 (no confidence) to 100 (maximum confidence), -1 for all level except 5
- text: detected text, empty for all levels except 5
Here is an example of the TSV format output, for reference.
| level | page_num | block_num | par_num | line_num | word_num | left | top | width | height | conf | text |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1024 | 800 | -1 | |
| 2 | 1 | 1 | 0 | 0 | 0 | 98 | 66 | 821 | 596 | -1 | |
| 3 | 1 | 1 | 1 | 0 | 0 | 98 | 66 | 821 | 596 | -1 | |
| 4 | 1 | 1 | 1 | 1 | 0 | 105 | 66 | 719 | 48 | -1 | |
| 5 | 1 | 1 | 1 | 1 | 1 | 105 | 66 | 74 | 32 | 90 | The |
| 5 | 1 | 1 | 1 | 1 | 2 | 205 | 67 | 143 | 40 | 87 | (quick) |
| 5 | 1 | 1 | 1 | 1 | 3 | 376 | 69 | 153 | 41 | 89 | [brown] |
| 5 | 1 | 1 | 1 | 1 | 4 | 559 | 71 | 105 | 40 | 89 | {fox} |
| 5 | 1 | 1 | 1 | 1 | 5 | 687 | 73 | 137 | 41 | 89 | jumps! |
| 4 | 1 | 1 | 1 | 2 | 0 | 104 | 115 | 784 | 51 | - | |
| 5 | 1 | 1 | 1 | 2 | 1 | 104 | 115 | 96 | 33 | 91 | Over |
| 5 | 1 | 1 | 1 | 2 | 2 | 224 | 117 | 60 | 32 | 89 | the |
| 5 | 1 | 1 | 1 | 2 | 3 | 310 | 117 | 224 | 39 | 88 | $43,456.78 |
| 5 | 1 | 1 | 1 | 2 | 4 | 561 | 121 | 136 | 42 | 92 | <lazy> |
| 5 | 1 | 1 | 1 | 2 | 5 | 722 | 123 | 70 | 32 | 92 | #90 |
| 5 | 1 | 1 | 1 | 2 | 6 | 818 | 125 | 70 | 41 | 89 | dog |
What do I look for in a resume?
Let's start by the things I don't look at in a resume for a position in which experience is expected:
- Your university: I couldn't care less where you've studied. While having a university degree may sometimes tell me you've been serious enough to go through the pain of university, I also know it's possible to go through university without acquiring any knowledge.
- Your grades: It's great that you have A+ in so many classes. However grades do not always generalize to an effective worker. Furthermore, most of the applicants will also have high grades, which makes it a noisy/useless signal. Do understand that I also do not have the time to fact-check your grades, so you might as well have written you had a perfect score in every class. Be careful with this, as some people will see it as something to probe you on during the initial interviews, and this could backfire on you.
- Your extra-curricular activities: Unless you are doing extra-curricular activities that are relevant to the position you are applying for, I am not filtering for people with whom I could do things with outside of work.
- The list of all your publications: I work in a scientific field, and while for some publications are badges of success, I see listing articles as filler into a resume. If I want to know all the articles you've published, I can look it up on Google Scholar. Instead, focus on listing the areas of research you're interested in and indicating how many papers in those areas you've published.
Here are the things I look for:
- 2-3 pages: If you cannot summarize your accomplishments in less than 3 pages, then you don't know how to summarize. I don't want to know everything you've done in your professional life. I don't want to know every single paper you've published, every conference/workshop you've attended, every grant you've received, every honors you have.
- Relevant work experience: If you've been working in the same position for a different company, that will generally be a good thing. It means you already have prior experience in the field, you have seen how another company has accomplished what you might still do at your new job. It means you'll be easier to ramp up and may require less supervision/support during that period.
- List what you did: If you only list the title/position you had and the company, I have no clue what you did there. You might as well not have worked there. Clearly list the big tasks/milestones you've worked on and what was your contribution.
- List clear and quantitative accomplishments: "Increased sales by 200%", "Largely reduced operation costs" may sound great, but without the ability to compare against something, those accomplishments do not mean much.
- List technologies you've used: When hiring it is often common that you want your new recruits to already have some prior experience in the tools that are used at the company, especially if you need them to ramp up quickly. This is even more important when the set of tools used in your industry is common enough as it will communicate how in touch with the field you are.
- Proper ordering of the sections of your resume: It's a little thing, but the ordering of the sections in your resume will communicate a lot to me. It will let me know whether you know how to prioritize, which is a critical skill. This point goes in hand with the "2-3 pages" item, as they both show that you are able to critically assess the content you produce.
- What you studied in university: I expect people that apply to the positions I filter for to belong into a certain set of domains. This is generally not a very important criterion, but it gives me a better idea of your professional career.
- When you finished your degree: This is used to determine how recent your education is. I consider professional experience once the degree is completed, not while it is being completed.
- Free of grammatical and syntactical errors: Make sure your resume doesn't have major grammatical or syntactical mistakes. Those communicate a lack of seriousness and professionalism that I would expect in your future communication with others in the company. If your resume was initially written in a different language, make sure it is thoroughly translated.
- Github account: If you list one, expect me to look at it. If you don't contribute much (less than 20 contributions per year), then it's simply better not to list it.
- Personal website: If you list one, I will look at it as well. I have a background in web development, so I will use it as an additional way to evaluate it. Make sure it is online. A personal website that is down or for which the domain expired will lose you points.