2022-04-28

The momentum of starting this section is the interview/conversation with Mr.Mao (Hr of Jump Trading). Considering that there may be many times in the future I need to introduce my previous work, writing down as a document can help me reduce the time and energy of reconstructing the structure and re-elaborate the content.

So I would like to divide my introduction into three parts: undergraduate life, first job, current job.

My undergraduate life

I first got in touch with CUDA when I was enrolled in a course named CUDA Programming and Practice in my second year, learning the CUDA-C programming language and GPU architecture using the matrix multiplication task. I was amazed by the parallelism of hundreds to thousands of threads inside a kernel function, and I found myself intrigued by the concept of parallelism. However, there aren't much opportunities or situations where I can use the GPU, I turned my interest in using the CPU to assist my daily work, like using the python multiprocessing library to accelerate the pre-process of dataset in data analysis, OpenMP directive to parallelize the computation, Message Passing Interface to invoke multiple processes to solve some specific problems, etc.

My stay with the Beihang Supercomputing Team was both memorable and rewarding. From setting up our ad-hoc cluster and installing the operating system, as well as compiling monolithic programs such as weather forecasting software, to learning about heterogeneous computing. Due to my experience and skills, I was voted team leader in my junior year and led our team in the ASC Student Supercomputer Challenge 20-21. Our efforts, from early preparation to the tense on-site presentation, were rewarded with the First Prize and Application Innovation Award. Press I feel incredibly lucky to have discovered my passion for HPC and large-scale distributed parallel acceleration during my college years.

In the ASC competition, I was responsible for a CPU optimization problem -- PRESTO, short for PulsaR Exploration and Search TOolkit. Pulsar is the remains of a core-collapse supernova explosion, which is a rapidly spinning neutron star spewing energy out into the space, so that we are able to receive its radio using our radio telescope like FAST in our country located in GuiZhou province. We use the Intel VTune profiler to detect the hotspot in the program and valgrind toolkit to highlight the critical pathway of the program. There are three main optimization used in the program:

  1. AVX512 intrinsic in Intel CPU to vectorize the data preparation.

  2. Unroll the preparation for loop and inline the million-times invoking function.

  3. Scale up to cluster from single machine by MPI toolkit.

What I learned from this optimization is "Never stop finding the potential opportunity and never be restricted by the given dataset." The final on-stage competition provided us with a novel datasets, going through the different branch of some key modules, which bypassed our optimization part.

Apart from the technical part, I also accumulated some experience in how to manage and organize a team. Well, we have to admit that human beings are born to be lazy creatures, the tiny amount of exceptions are not included today. Therefore, everyone need some impetus or stimulus to push themselves to move forward, whether inside, your goal, your dream, your interest, or the outside, the peer pressure, the deadlines, the okr, the kpi, your manager, house loan, etc. Honestly, I didn't manage our team well at first, everybody is just attending the weekly meeting routinely and we behave like doing our part-time job. So how to change this? According to my experience being the teaching assistant in the Object Oriented Programming course, the most simple and effective approach is dining together. We can improve our relationship first without considering the work while having dinner, and due to its private attribute, people are naturally putting down their guard. When the atmosphere of a team becomes harmonious, members will feel embarrassed if they contribute less to the team than others, or block the process of the whole team. What's more, assigning specific and measurable task is also a key to healthy development. Why do people often loss themselves even when they know the goal? Maybe the goal is too far to reach within a step, which leads to incorrect estimation of workload as well as reluctance to make the first step. I believe that is also why okr is prevailed these years, since it is the reachable key results that guide our road and assist you make step-by-step improvement.

The most significant and effective way is to make them feel the tense environment on the final stage as substitute. I was supposed to attend the competition in my junior year, but due to the covid-19 pandemic, the competition was postponed and as a result, the first time I took part in is also the final time. It is quite difficult to describe the atmosphere, you have to be at the scene to feel that. Your competitors are around you, and you can clearly see what they are doing, and you will be devoured by the peer pressure. At that time, the adrenaline, the sense of responsibility in a team, the desire to succeed are all magnified. Your body won't feel any tired, but will be filled with energy. We only have breakfast for continuous two days for the contest, and not feel hungry as well.

Last but not least, I also recognize the power of bravery or to say confidence. Since we need to make an oral presentation in English to illustrate our work, I was appointed to deliver the speech due to my overall master of our work and my aura field. I have much experience of presentation since I always push myself to be the team leader in the team work. To persuade others, you need to persuade yourself first. Don't be timid or self-abased. Be confident for your work and take pride of your work inside your heart.

My first job

Why do I quit the chance to have the master degree in Beihang University? The main reasons are two: I want to leave the chance of master to a different country and I have found my interest now. First, it is difficult to read the second master degree if you want to go aboard for further study after graduation of master, while the master is a relatively easy way to get the visa and explore the world. Second, I want to concentrate on the HPC field or Machine Learning field, instead of having a bunch of diversified courses in the university. I got offers from self-driving companies, streaming companies, and an online education unicorn. I finally chose the online education unicorn for several reasons.

  1. The leader in the team is an alumnus in Beihang University, so I can feel the tight connection with him.

  2. He came up with some fancy ideas of constructing the AI platform, like the hybrid development of online and offline cluster to maximize the utilization of our resources, and elastic training strategies to assist the orchestration of cluster.

  3. He has rich experience in this field, and I really hope to learn some practical experience in industry from him.

  4. I was the first group of users of the online education unicorn, I used their production in my senior high school, and actually I benefit from them, which set up my mind that they are doing good to the society and education field.

Well, something unexpected may happen anytime. As the policy of alleviating academic burdens on students, online education products are restricted to a large extent. And I met my first layoff after starting my career for three months. I have to say I did learn something thanks to my mentor. Learn in practice is really an efficient way. The first project is a natural language processing one, and we achieved 3.5x speed up by optimizing the following parts:

  • In the dataset partitioning part, modify the distributing strategy from allocating equal amount of sentences to nearly equivalent length of words to mitigate the load imbalance

  • Applying mixed precision training in forward propagation by emplacing the 32-bit TF32 with 16-bit FP16 in actual computation whilst keeping a copy of full precision in the usage of updating with gradient

  • In the backward propagation procedure, vectorize the computation of gradient by flattening tensor to 1-Dimension Vector from 3-D Matrix to maximize the throughput and cut down the kernel launch times

  • During the parameter updating process in the optimizer, employ the Sharded Distributed Data Training strategy introduced by Microsoft to scatter the parameter to all workers instead of centralizing together, further reducing the redundant computation among all.

Last updated