DataFest 2017 — Increasing EXPEDIA’s Bottom Line: Leveraging Packages to Increase Commission

7 min readApr 17, 2017

Data science is a fascinating field, and engineers need a fundamental understanding of data in order to create more impactful platforms. Last weekend, my team and I entered the DataFest 2017 hackathon here at Middlebury — DataFest was our first forray into the world of Data Science, and allowed my teammates and I to exercise the skills we’ve refined in other domains (our team consisted of majors in Biology, Math, and Computer Science). Ultimately, we feel as though the diverse academic interests of our team culminated in a more thorough anlysis. This year’s data set was provided by Expedia and included data on millions of user sessions. Our goal as a team was to analyze the data and develop actionable items that Expedia could use to optimize revenue.

Converting and Cleaning the Dataset

When the dataset was released at around 6:30 p.m on Friday, it was given to us in a text file that was tab delimited. We wanted to acquaint ourselves with the data before we proceeded to clean and process it, so we converted it to CSV and opened it up on Excel.

Converting the dataset form a text file to a csv file.

Once we had the dataset converted (which was a fairly simple process in Python), we opened up the Excel file and checked out the names of various columns as well as the types of data that we were given. The column headers and data dictionary provided with the dataset gave a description of what each column represented — all the fields were very interesting, and they combined to form a very nuanced narrative of Expedia’s user sessions. Converting to CSV format was also very useful because it allowed us to see how the data was formatted — in particular, how fields with missing or erroneous data were handled. Our initial investigations showed us that such fields had the value ‘NULL.’ In order to make accurate visualizations and glean any valuable insight from the data, we needed to clean the data and correct any erroneous entries to stop anomalies from appearing in our analysis.

In order to automate the way we processed the data, we utilized the Pandas library in Python. We use Pandas data frames to clean and analyze the data — this was my first time exploring Pandas, but the documentation and online support for the library is very thorough, making it very simple to get started with right away.

Cleaning the data.

In the process of cleaning the data, I removed duplicate entries and entries with ‘NULL’ columns, and then printed the shape of the data frame to confirm that the unwanted entries had been removed. A quick scan of the updated CSV also confirmed that the unwanted values had been removed.

Focusing our Approach

Now that we had a cleaned up dataset, we could focus our approach in order to paint a clearer picture of the data. We realized that the dataset included many different types of sessions, and within each session, the user had different preferences in terms of the number of adults and children travelling, the number of rooms booked, if the booking was part of a package deal, etc. After running some initial stats using the describe() function in Pandas, we had a general idea of frequency variations between columns.

Our intuition told us that the most common family compositions represented in the data would fall into 4 buckets: 2 adults, 1 room (a couple), 2 adults with 1–3 children, 1 adults with no children, and 1 adult with 1–3 children. After filtering the data, these family compositions proved to be the most recurring — representing over 70% of the data set. We proceeded to create sub-data frames — 1 for each family composition. Now that we knew which type of customers we wanted to focus on, we had to decide on what aspect of these sessions we could focus on in order to extract the most useful insight.

We hypothesized that within these four family compositions, there would be variations in terms of socioeconomic status. We further hypothesized that people from lower income areas would be a) travelling less (and thus, be underrepresented in the dataset) and b) travelling to lower star rated hotels because they are cheaper. The latter was confirmed by our analysis of the data — lower star rated hotels (1–3 stars out of 5) were underrepresented across all the family compositions. The former was confirmed after we set a proxy for income as a means of analyzing the data with an eye towards socioeconomic status.

Our proxy for income was the city from which the user booked the deal (assumed to be the home city of the user). We chose city as our proxy because we could easily merge our data frames with data for the mean income of the most represented cities in the overall data set and subsequent data frames. In selecting the cities that would serve as our proxy for income, we had 3 criterion: 1) they needed to be among the most represented in each data frame (in order to maintain the integrity of our analysis), 2)they needed to vary in terms of region, and 3) they needed to vary in terms of mean income levels. The final criterion would ensure that our analysis served the entire spectrum of income levels. Admittedly, cities are not the best proxy for income/socioeconomic status, but given the time constraints of the hackathon, selecting cities was the most time efficient option. Future work could focus more closely on specific locations in order to draw a stronger connection between location and income.

We selected 15 cities: New York, Los Angeles, Houston, Toronto, Chicago, Calgary, Brooklyn, San Francisco, Seattle, Miami, Vancouver, Montreal, San Jose, Denver, and Atlanta. Of these cities, San Jose had the highest mean income at $101,980, while Miami had the lowest income, at $46,946. We then merged this data with our data set to complete each of our data frames.

Merging income with family composition 1. Repeated with the other family compositions as well.

Analysis

With our data frames completed, we now focused our attention on the interesting insights we could extract from these particular compositions.

Using cities as a proxy for income, we can see that package deals are more popular in lower income areas: almost half the deals booked from Miami were package deals. Similarly, we see higher rates of package deals being made in Brooklyn and New York — both towards the median in terms of our income distribution. Notably, cities like San Jose and San Francisco — towards the higher end of our income distribution — tended to close much less package deals. It would benefit Expedia to maximize package deals since this would allow them to make sales at a higher volume.

Family composition 1 (2 adults, 0 children). Where are people from various income levels staying?

The above highlights two important ideas: lower star rated hotels are not very popular in general, but they are most highly represented in lower income areas like Miami, Denver, and Chicago. Interestingly, we can also see that people from higher income areas are staying in higher star rated hotels that are of a lower price category — in other words — they are getting better deals.

Family composition 3 (1 adult, no children). Where are people from various income levels within this composition staying?

Again, this graphic highlights that lower star rated hotels are not as popular, indicating that lower income people tend to travel less.

Package deals are more popular among lower income travellers.

While higher star rated hotels are more popular, it is important to note people from lower income areas are booking less (as we hypothesized), and are also purchasing packages at a higher rate than people booking higher star rated hotels.

Conclusions

From our observations, it is clear that lower income to average income family compositions are booking packages more frequently than higher income compositions. Because of this trend, Expedia is missing out on the opportunity to market package deals at a higher volume to a significant portion of their customer base. We concluded that the development of higher end “premium packages” to be marketed to people from average to higher income levels would allow Expedia it optimize their revenue function. These premium packages could come in the form of higher rated hotels, nicer car rentals, and more expensive airfare.

For future analysis, it would be interesting to note where transactions take place, specifically — are people from higher or lower income areas booking more deals from their phones? And if so, how can we act on this knowledge to increase the volume of packages being sold? Correlating this data to our family compositions and package analysis would allow Expedia to invest more efficiently in mobile or web based marketing and technological infrastructure.

Overall, this was a very cool experience for our first foray into the data science world. We look forward to attending future hackathons and building even cooler things!

To check out our full presentation complete with all graphics, click here, for the corresponding write up, click here.

All of our code is open sourced at aumitleon/github/datafest2017.

DataFest 2017 — Increasing EXPEDIA’s Bottom Line: Leveraging Packages to Increase Commission

Converting and Cleaning the Dataset

Focusing our Approach

Analysis

Conclusions

Written by Aumit Leon