Just to be clear: data used in this presentation is fake to preserve privacy.
You can see online the Jupyter Notebook that accompanies this blog post, as well as the slides.
1 - Get all the data
You probably have your data scattered across a few different places. For instance:
- From your e-commerce platform provider. If you’re using Shopify, you can export that data as a CSV (comma separated value) from the admin, or you can automate the process using the Shopify API. Examples of useful data to have : list of customers, discount codes, abandoned checkouts, orders, etc.
- From Google Analytics. You can download the data as CSV or automate the process using the API. This is a whole subject in and of itself, and Vanessa Sabino has already covered the subject brilliantly.
- From logs: If you are hosting your own e-commerce solution, you can log you visitors sessions and use those to understand their behaviour on your site.
- From spreadsheets : Maybe you have a spreadsheet with your products cost and retail price. It’s useful to use that information in conjunction with other sources (like your real orders) to get new information (like your profits).
For data exploration, I really like Jupyter (formerly iPython Notebook). The notebook form makes it easy to tell a story and keep track of everything we try. You can view the notebook for this presentation on Gist.
Let’s tackle two business questions.
Q1 - Which color should I use for my next t-shirt design?
I have my own online store where I sell t-shirts that I design for physicists and their friends (I’m a physicist myself). I have way too many colors on offer for each design. I did that on purpose, to give people the choice and learn which colors are the most popular with my customers. Now it’s time to narrow it down.
First, I load the data from the CSV file I downloaded from my Shopify store.
Then, I locate which column is useful to me, in this case it’s “Lineitem Name”. It’s a string (i.e. text) that looks like this : “Shine on you crazy diamond - M / Cream / Men”. The first part is the product name, then the 3 variant options separated by slashes. But there’s a twist : not all lineitem names have variant options, and of those who do, not all of them have color.
Python is good at handling exceptions with “try / except”, so we can write a function that will take in the lineitem name and return a list of variants (when variants exist), then another function that takes that list of variants and returns the color (if there’s a color in there).
We can chain those functions together, and do a final count of the values to get our most popular colors : sea foam and eggplant.
I like this example because it shows how Python can be useful for parsing text, in this case, extracting the colors from the lineitem description, even when the color is not always in the same position
Q2 - Is there a problem with my shipping rates?
I recently wrote a whole article about his problem for the Shopify e-commerce university blog : How Nijala tweaked their shipping strategy to win more sales. Here’s the gist of it : potential customers can abandon their cart anytime during the checkout process. In particular, the ones leaving at the shipping information page are usually disappointed there is no free shipping, shipping rates are too expensive, or there is no shipping offered to their location. Therefore, it’s a good proxy to diagnose shipping related problems.
We want to see the proportion of shoppers who leave at the shipping method page. In my example (which is a typical Shopify checkout), the only possible page after the shipping page is the payment page, so the number of shoppers who drop out of the checkout funnel at the shipping page is the difference between the number of shoppers who reach the shipping page and the number of shoppers who reach the payment page.
This can be easily visualized in Google analytics if you set up a checkout goal funnel. You can also pull the data from Google Analytics and do the calculation yourself. Once you know the number of people who reached the shipping info page (num_ship) and the number of people who reached the payment page (num_pay), then the drop-off ratio at the shipping info page is (num_ship - num_pay) / num_ship.
3 - Scale
I do this kind of analysis for hundreds of thousands of merchants several times per day. The key is to build a data pipeline that gets launched by a scheduler on a regular basis. We used pySpark for the data pipeline and Oozie for the scheduler.
We basically do same thing as the above Q2, but we also
- replace the CSV file data input by ready data from HDFS (Hadoop distributed filesystem). It enables storing huge datasets distributed across many machines on a cluster;
- replace Jupyter by pySpark. It enables fast distributed computing on a cluster;
- key by Shop ID, so that we can aggregate the data shop by shop;
- add thresholds, like minimum number of orders, or minimum confidence level for a prediction, so that we don’t report statistical flukes;
More ideas for analysis
- Pages with high traffic and high bounce rate : this is where your efforts will pay out most.
- Search terms on your store and to find your store: match the words used by your customers.
- Track other funnels, like email campaigns : find and address the bottleneck in the conversion.
- Segment with care (beware of false positives) : by content type, product category, referrer, etc.
- “Exploring the Google Analytics API” by Vanessa Sabino : https://www.youtube.com/watch?v=YUqaCkEwr6g
- “Statistics: Making sense of data” : https://www.coursera.org/course/introstats
- Jupyter : http://jupyter.org
- Set up GA goal funnel : https://docs.shopify.com/manual/reports-and-analytics/google-analytics/google-analytics-goals-and-funnels