Wrangling and cleansing business data

Introduction

Raw data of real analytical use cases in a number of industries and companies are frequently provided in an Excel-based form. These files usually cannot be processed directly in machine learning models, but must first be cleaned and preprocessed. In this process, many different types of pitfalls may occur. This makes data preprocessing an essential time factor in the daily work of a data scientist [1].

In this concise project an Excel spreadsheet will be presented which in this form is closely oriented to a real case, but contains only simulated figures for reasons of data and business results protection. However, the form and structure of the file corresponds to a real case and could be encountered by a data scientist in a company in this way.

This project is done in Python with a Jupyter Notebook. First the data is read in then cleaned and pre-processed. Finally, the various features are examined and a simple regression model is applied to describe the underlying relationship between working hours and goods sold. This article discusses the most important aspects and results. For a detailed look at the code, please refer to the Jupyter Notebook referenced on the right and the corresponding GitHub repository.

Data set, import and pre-processing

First of all, the present Excel file is studied in more detail. It contains two sheets, one containing sales data in various Group stores and the other providing information on opening hours and therefore maximum working hours at the locations. A first important note for practical application relates to cooperation with colleagues in the departments that have compiled this data:

Before starting the data processing, as much information as possible about a data set should be obtained from the colleagues in the relevant departments. This can contribute significantly to faster and cleaner pre-processing.

In our example the existence of the values '#NV' and 'Err:520' in the table - which is shown in its Excel-view in Figure 1 - can be determined in this way. They can be transformed directly into Pythons/Pandas 'NaN' on import.

Fig. 1: Screenshot of the provided Excel sheet

Next, the data types are checked after they have been read and adjusted if necessary. The same applies to implausible and missing values in the data set, which are replaced appropriately. More details on this can be found in the Data allocation and preparation section of the notebook. It is noticeable that there are structuring elements for visual reasons in the Excel table and that a single month contains no information at all. This touches another important point in the handling of Excel raw data:

Excel is a working tool for many departments, which means that the job is done directly in the document. This leads to formatting and input being carried out in such a way that the Excel user gets along with the information displayed as quickly as possible and that it is visually appealing. However, this contradicts the automatic retrieval, which must be respected when working outside Excel.

In the further pre-processing, the monthly figures are cumulated so that seasonal fluctuations can be excluded. In addition, column names are changed and the working time is regarded as a relation to the opening hours. This reflects better the effort made in a market/department.

Exclude departments based on feature distributions

By examining the distributions respectively histograms across department of the different features originally present in the data set, a limitation to essential departments can be made. Some examples to the backgrounds:

Fig. 2: Frequency distribution of
feature 'HoursRatio' in department 3 (*other*)

Department 3, which is other, shows a very narrow distribution (see Fig. 2). The value range here is only between 0.128 and 0.143, while the fluctuation in the other departments is not only in the decimal range. After a view into the data set it becomes clear why: The recorded hours are all over the same. Only the cumulative hours as divisor provide the small variance in the feature 'HoursRatio'. This case must be passed back to the business, since it is highly doubtful that the values for department 3 are correct. Therefore, it cannot be considered in the further analysis also.

Fig. 3: Density of goods sold in the departments
15 *(Admin)*, 16 *(Customer Service)* and 17 *(others)*

Fig. 4: Frequency distribution of
sales area in department 11 *(Delivery)*

Departments 15 (Admin), 16 (Customer Service) and 17 (others) are not part of the original sales areas. For this reason, their distribution in goods sold - which is quite the same (see Fig. 3) - is not very meaningful. They are therefore removed from the data set.
Department 11 is Delivery. This makes it comprehensible that no sales area is required here and that the distribution is therefore centered on 0 (see Fig. 4). The department is also removed from the data set.

These examples illustrate the importance of knowledge about the organization and infrastructure of a company.

For data analysis it is indispensable to obtain information about the general conditions in the company. This allows special characteristics to be taken into account and extraordinary effects to be eliminated. If the context of the business is not understood, this causes biases.

This leaves the following departments for further analysis: Dry, Food, Non Food, Frozen, Fish, Fruits & Vegetables, Meat, Clothing, Household, Hardware

Correlations between features

The next step is to look at the mutual relationships between the numerical features grouped by department. This is done with a so called pair plot, shown in Figure 5. The distribution of each feature, color coded across the departments is shown on the diagonal. The scatter plots on the upper right hand side are showing the relationship between the named features. Each scatter point represents one department of a specific store. Finally, on the lower left hand side, a density estimate is drawn for the different departments across all stores.

Fig. 5: Pair plot including feature distributions on the diagonal, scatter plots on the upper right side
and density estimates on the lower left side color coded by department

In this pairplot different findings can be seen:

Especially the distributions of sales units and turnover are dominated by a few departments with very small quantities.
The departments appear as (sometimes) overlapping but clearly defined clusters or areas in the relationships between the individual features.
The relationship between sales units and hours ratio (second column, first row) has department clusters that are somewhat ellipsoidal, i.e. linear. This suggests a further investigation using linear regression.

Fig. 6: Correlation between sales units (Sales),
ratio of working hours (Hours) and sales area (Area)

We can also look at the correlation coefficients between the features. Due to the proximity to the sales units, turnover is not included here. This results in three distributions of correlation coefficients, which are shown in Figure 6. The distribution results from the different departments. It can be seen that, on average, the linear relationship between sales units and working hours is the largest. This makes the further modeling of the relationship between sales units and working hours ratio interesting not only from the management's point of view. Rather, depending on the department, there seems to be a linear relationship between these variables within certain limits.

Linear regression of hours ratio vs. sales units

We consider the dependence of the sales units on the invested working time. As we have seen before, this only makes sense if it is done per department. This leads us to a regression analysis of the goods sold depending on the hours worked in a specific department. We investigate this relation separated by departments and check whether we get a narrow dot cloud here, which would support a linear dependency. It can be seen that some of the departments such as Frozen or Meat have a clearly defined linear structure, as seen in Figure 7. Others are a bit more indifferent, such as Dry and Household (see Figure 8).

Fig. 7: Sales units vs. ratio of working hours for example departments with a narrow scatter

Fig. 8: Sales units vs. ratio of working hours for example departments with a broader scatter

Finally, a linear model for the prediction of sales figures with known working hours will be trained. This is done in the awareness that the statistical significance is rather low due to the small amount of data available. The algorithm is implemented in Python using the LinearRegression() model of the scikit-learn library [2]. A quarter of the data is used as test set. In this way, the linear models shown in Figures 9 for the Meat department and 10 for Frozen are obtained.

Fig. 9: Linear regression model for department *Meat*,
R²(train) = 0.69, R²(test) = 0.83,
f(w) = 0.428 * w + 4.703

Fig. 10: Linear regression model for department *Frozen*,
R²(train) = 0.67, R²(test) = 0.72,
f(w) = 0.131 * w + 0.163

References

[1] Press, Gil. "Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says." Forbes, 23 Mar. 2016, www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#6ea73f616f63.

[2] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of machine learning research, 12.Oct (2011): 2825-2830.

Photo at title by Hanson Lu on Unsplash