Data Engineer Interview Experience - Walmart

Walmart is the worldโ€™s largest company by revenue, and #1 Fortune Company in the world. Walmart was named the largest corporation by revenue on the Fortune Global 500 list for the 11th straight year

Mentor

Blog

I worked as an SDE-2 (Data Engineer-II) at Meesho & previously worked at Morgan Stanley.

I have around 2.5 years of experience (including internship).

I got a referral on LinkedIn for the Data Engineer position at Walmart Bengaluru Division.

Walmart Global Tech consists of five technical rounds and one HR round for the Data Engineer - 3 roles.

The recruitment process was as follows:

Round 1: Preliminary Round (Screening Round): Telephonic Round

The first round lasted for around 45 minutes.

In this phase, I provided a comprehensive overview of my prior projects, focusing on my experiences with tools like Mixpanel, Kafka, ETL concepts, Datahub Spark Lineage, Spark.

I also discussed the Data Model I developed during Experimentation (A/B testing) on Presto architecture.

The recruiters also asked questions like "Why do I want to work at Walmart?"

Round 2: Technical Interview 1 (Coding & DSA Round): 1:30 minutes

I received a call from HR, informing me that I had successfully passed the screening round and was selected for a technical discussion.

This interview, conducted by a Senior Data Engineer at Walmart, lasted for approximately 1 hour and 30 minutes

The interview primarily revolved around the following topics: ๐Ÿ‘‡

  1. Medium-level Data Structures and Algorithm (DSA) questions.
    1. Challenging SQL-based questions.
      1. Python coding questions.
        1. Big Data concepts.
          1. Spark, Kubernetes, and Airflow Architectural questions.
            1. Cloud Computing concepts.
              1. Software Development Life Cycle (SDLC).
                1. Agile methodology, with a particular emphasis on the Scrum framework at a high level.
                  1. Questions about DevOps strategy at a basic level.
                    1. Continuous Integration/Continuous Deployment (CI/CD) pipelines.
                      1. NoSQL databases.
                        1. AWS services-based scenarios.
                          1. Medium-level data structure questions, including arrays, stacks, linked lists, and trees.

                            Now let's move on to the kind of questions asked.

                            I was presented with two DSA questions during the interview: ๐Ÿ‘‡

                            1. Find the minimum number of coins that make a change. 
                              1. Given a linked list and a value x, partition the linked list around the value x in such a way that all nodes with values less than x come before nodes with values greater than or equal to x. If x is present within the list, it should be positioned after the elements less than x. The partition element x can be placed anywhere in the "right partition"; it does not need to be placed between the left and right partitions.

                                The interview also included SQL-based questions such as: ๐Ÿ‘‡

                                ๐Ÿ“Œ How to find the nth highest salary for each department using Window Function or without using a window function. 

                                Given an employee table with attributes empId, empSalary, and empDeptId, and a department table with attributes deptId, depName, and CourseOffered, I was asked to write an SQL query to find the employee with the highest salary in each department using window functions on the notepad.

                                I used the dense_rank window function to construct the SQL queries.

                                I was also asked to explain the reason for using dense_rank instead of the rank function.

                                Some questions on Spark Optimisation with Hadoop Concepts such as: ๐Ÿ‘‡

                                ๐Ÿ“Œ How Airflow Kubernetes works using Pod Concepts

                                ๐Ÿ“Œ How the Airflow scheduler works with Worker machine with the webserver

                                ๐Ÿ“Œ Difference between Container Deployment vs. Stateful Deployment in K8s.

                                Explain how Kubernetes manages the fault tolerance.

                                ๐Ÿ“Œ You have a Spark job that is taking longer than expected to complete.

                                What steps would you take to identify and troubleshoot performance bottlenecks?

                                ๐Ÿ“Œ You have a Spark cluster with limited resources.

                                How would you allocate resources and configure the cluster for optimal performance?

                                ๐Ÿ“Œ Write a code for uploading Parquet files on the S3 bucket using the boto3 library (as I worked on AWS).

                                I wrote the code for the same using Python and boto3 library on a notepad.

                                ๐Ÿ“Œ How Airfow Stored logs in the S3 bucket & how the backend database of airflow plays an essential role?

                                The interview covered various topics, including Spark optimisation, Kubernetes, Airflow, Big Data concepts, and included explanations related to specific projects.

                                Round 3: Technical Interview 2 ( Data Modeling/System Design with Big Data Concepts): 1 hour 45 minutes

                                Again, I received a call from the HR confirming that I had passed the first technical round and was selected for a technical discussion.

                                This round was conducted by a Staff Data Engineer from Walmart, lasting approximately 1 hour and 45 minutes

                                The conversation kicked off with a System Design task, where I was tasked with designing the Mixpanel system, an event-driven system that I was familiar with due to my experience with Mixpanel at Meesho.

                                I started by using draw.io to illustrate how the Mixpanel system operates, detailing how events are captured by various systems, including Android App, Web App, and iOS App

                                During the System Design phase, I encountered several questions such as: ๐Ÿ‘‡

                                ๐Ÿ“Œ How the load balancer works in the Mixpanel?

                                ๐Ÿ“Œ How the requests are being handled?

                                Let's suppose you open the Presto URL on Chrome, then this request goes to the DNS for IP address resolution, the load balancer, target gateway and finally reaching the Presto Coordinator. I was asked to provide a detailed explanation for each concept.

                                ๐Ÿ“Œ The recruiter asked me to write a custom API using the spring-boot by writing only the service & controller classes using Springboot & Java API.

                                ๐Ÿ“Œ Some questions on Spark Coding. 

                                I was asked to write code to read data from delta lake (S3 bucket) & run the upsert command to update the data if the data already exists based on the primary key & insert the data if it didn't.

                                I wrote the code using the DataFrames.

                                ๐Ÿ“Œ Questions on Spark Optimisations included Skewed Join, Broadcast Join, CBO & repartion vs coalesce.

                                ๐Ÿ“Œ After that I got questions on Spark Tungsten & Catalyst Optimiser.

                                ๐Ÿ“Œ Now shifting to Java & Advanced Java the questions revolved around Java collections such as the Interface, Map, LinkedList design & Garbage collection.

                                Java Coding Question & OOPS Concepts: ๐Ÿ‘‡

                                ๐Ÿ“Œ I was asked to write the java code to run the garbage collection using GC collector thread.

                                ๐Ÿ“Œ This included explaining the concept of multithreading followed by writing code for Synchronisation using Synchronised Threads.

                                ๐Ÿ“Œ Some Questions on Serialisation vs Deserialisation.

                                ๐Ÿ“Œ Explain the use case of the transient keyword in java.

                                Questions on System Design Conceptual & Synchronisation: ๐Ÿ‘‡

                                ๐Ÿ“Œ What is the Semaphore variable? How do you prevent deadlock in the system?

                                ๐Ÿ“Œ The recruiter asked me to complete the Semaphore code for the synchronisation achievement.

                                So I wrote Semphore in Java,

                                import java.util.*;
                                
                                class Semaphore_Interview_Round_Technical {
                                    public enum Value { Zero, One }
                                    public Queue<Process> q = new LinkedList<Process>();
                                    public Value value = Value.One;
                                
                                    public void P(Semaphore s, Process p) {
                                        if (s.value == Value.One) {
                                            s.value = Value.Zero;
                                        } else {
                                            q.add(p);
                                            p.Sleep();
                                        }
                                    }
                                
                                    public void V(Semaphore s) {
                                        if (s.q.size() == 0) {
                                            s.value = Value.One;
                                        } else {
                                            Process p = q.peek();
                                            q.remove();
                                            p.Wakeup();
                                        }
                                    }
                                }
                                

                                This is the code I submitted (for reference learning you can refer to here).

                                The last Questions they asked on ETL concepts & data warehouse concepts which are general questions such as: ๐Ÿ‘‡

                                ๐Ÿ“Œ What is the difference between Snowflake vs Star Schema?

                                ๐Ÿ“Œ How to design the data warehouse from scratch if you have new requirements?

                                Here, I explained my experience with Snowflake and Databricks implementation at Morgan Stanley.

                                ๐Ÿ“Œ Explain Normalisation Concepts & SCD-2 Type with an example.

                                ๐Ÿ“Œ How to onboard delta lake catalog to Presto?

                                ๐Ÿ“Œ Why Agile is preferred over the waterfall model?

                                I explained this with a focus on the Agile framework (Scrum) by taking concepts of a sprint, Jira Board, and the iterative approach in detail. 

                                Round 4: Techno-Managerial Interview (Managerial Round): 1 hour 10 minutes

                                Moving on to round 4, the interview started with my introduction, my expertise and the technical skillset that I had worked on.

                                Most of the questions were asked based on Data Modeling, Databricks, Datahub, Py Spark and architecture Design (ETL Design) topics.

                                1-2 questions were asked based on batch processing & stream processing using Spark.

                                The recuiter asked me to explain my project on Mixpanel and how I created the data model using delta tables to simplify the creation of raw tables.

                                I gave a detailed explanation of the work I did at Meesho, explaining how we sourced data.

                                I also explained the complete data pipeline I set on data bricks to take silver or mix panel data & run the multi-tasking job to create an aggregated table based on business requirement.

                                I was also asked about about my contributions to open-source projects.

                                I discussed Datahub, and Spark Lineage build (which helps to find the source & destination table for the Spark application).

                                I explained how I created Spark jar with Spark listeners and the Spline package.

                                There were questions related to Cost Optimisation as well such as: ๐Ÿ‘‡

                                ๐Ÿ“Œ Can you share an example of a project you worked on that had a significant impact on your organisation?

                                ๐Ÿ“Œ How did you contribute to cost optimisation initiatives while working with cloud technologies?

                                ๐Ÿ“Œ Could you describe a specific cost optimisation strategy you implemented in the cloud and its results?

                                I was also asked how I capture the event logs and user activities on Databricks, especially regarding cluster creation and job execution.

                                The interviewer also asked questions based on Spark monitoring & Spark performance management.

                                I explained all the answers in detail by taking practical examples.

                                Additionally, there were some questions on JIRA and Scrum like how will you manage multiple tasks using Agile methodology.

                                Round 5: Director Round (Behavioral & Technical Round): 45 minutes

                                This interview was taken by the Director of Walmart.

                                It lasted for about 45โ€“60 minutes.

                                Firstly, I was asked to introduce myself.

                                We delved into my experience with the Meesho and Morgan Stanley projects. Specifically, I explained the Datahub Spark Lineage Project Tenant Project and my role and responsibilities as a Data Engineer at Meesho.

                                I was also asked to explain my research papers on "Web Crawler for Ranking of Websites Based on Web Traffic and Page Views," which I had published in International Conferences of IEEE & Springer.

                                Some of the questions were related to the core principles or core values of Walmart and my inspirations.

                                Then he asked questions related to team management & leadership qualities.

                                I was mainly asked questions that were situation-based such as "Tell me about a time when you faced a challenging situation at work and how you handled it."

                                The conversation shifted to my technical expertise.

                                He checked my resume and asked some questions related to Presto vs Spark, as both are using Distributed architecture, Databricks, AWS & Delta Lakes concepts with Data Governance.

                                Here are some of the questions I remember:

                                ๐Ÿ“Œ What is Avro file format & what is its significance in delta tables?

                                ๐Ÿ“Œ Difference between Presto vs. Spark underlying architecture.

                                ๐Ÿ“Œ Can Presto work with Near Real-Time Data ( Streaming Data Source)?

                                ๐Ÿ“Œ How did I develop the Datahub using Open Source Projects such as Spline & Datahub?

                                ๐Ÿ“Œ What do I think about Data uncertainty?

                                During this round, I also mentioned a few of my achievements including being that I am a Gold Medalist from Uttarakhand State in B.Tech.

                                My responses left a positive impression on the Director, and I received positive feedback for my performance during this round.

                                Round 6: HR Round (General Discussion & Salary Discussion)

                                This round lasted for approximately 30 minutes.

                                During the interview, I was questioned about my experience with Big Data projects, my hobbies, as well as my strengths and weaknesses.

                                The interviewer also inquired about my family background, previous interview experiences, and my ultimate life goals.

                                Towards the end of the interview, I was asked, "Why should we hire you?" and "What inspires you to join Walmart?"

                                Additionally, there was a discussion about salary with the HR representative.

                                The following day, I received positive feedback from HR.

                                Fortunately, I was selected for the position of Senior Data Engineer (Data Engineer-3) at Walmart.

                                Finally, I had the opportunity to join my dream company, Walmart, which is also the world's number one Fortune 500 company. ๐Ÿ˜Š

                                If you're also looking to crack top product-based companies or into the data field, connect with me on a free 1:1 call so that we can work together on your goal.


                                Recommended Readings:

                                Data Engineer Interview Experience at JP Morgan & Chase

                                How to become a Data Scientist at Microsoft?