Too Much for a Spreadsheet?
Here’s How You Can Easily Analyze Big Data
The more rows of data in set, the more difficult it can be to evaluate with traditional methods
The rise in data is omnipresent thanks to technological advancements that have digitalized just about everything. From power consumption of household laundry areas to the behavior of a heat exchanger on the factory floor, the amount of data available for analysis can be overwhelming.
There is much more that can be learned from data that is already available. But tapping into it can be a daunting task for someone who is not a data scientist. The larger the dataset, the more intimidating it can be.
Large datasets require special solutions. For demonstration, a simulated dataset for the period of January 2019 through December 2021 was created based on data in the Household Power Consumption dataset from the UCI Machine Learning Repository. The multivariate time-series dataset has 2,075,259 rows of one-minute averages. These are the details of the dataset:
The total active power consumed (kW)
The total reactive power consumed (kW)
Current intensity (A)
Kitchen area (Wh)
Laundry area (Wh)
Air-conditioner & heat (Wh)
Because the dataset is so large, it cannot be evaluated using a spreadsheet. The sheet runs out of rows to handle the set. Traditionally, a data scientist would apply algorithms or computer code to sort the data into manageable rows. But waiting on a data scientist simply because a spreadsheet cannot handle a dataset is not always possible. Instead, you can apply advanced industrial analytics to explore this extensive data.
Overview of the Data
The plotted data in Figure 1 shows that household power consumption dropped during August in the second year. It demonstrates the intensity of power consumed and provides an engineer with a clear picture of the anomaly. Because energy use that was measured is consumed in three areas (kitchen, laundry, and heating and cooling rooms), it is necessary to take a closer look at how consumption dropped during that time in each area. Figure 2 shows that the consumption in the kitchen and laundry areas stopped entirely. The heating and cooling area was active, although consumption was lower than what would be considered normal.
Figure 1. Plotted data for June 2019 through December 2021 shows active power consumption dropped significantly during August 2020.
Figure 2. Active power consumption dropped or ended in all three regions during the anomaly.
Engineers often find that exploring the relationship between all variables of a multivariable dataset in a single visualization allows them to detect positive or negative relationships between them. Figure 3 shows the multivariate relationship between different attributes. One such relationship between “global_intensity” (Amps) and “global_active_power” (kW), with a coefficient of determination (R2) of 0.99, is highlighted by the blue box. The strong positive relationship between power and current should not be a surprise; otherwise, it would violate Ohm’s law. In addition to the bivariate relationships, the histograms of each attribute are highlighted by red boxes. These are extracted to show as one image in Figure 4.
Figure 3. Multi–scatterplot relationship of all attributes except “global_reactive_power.”
Figure 4. Distributions of all attributes except “global_reactive_power.”
Among all attributes, voltage is the only variable with a normal distribution. Kitchen and laundry areas mostly peak at exceptionally low consumption, whereas the heating and cooling area has a solid bi-modal distribution. This shows that the HVAC (Heating, Ventilation, and Air Conditioning) system consumes much more energy than the kitchen or laundry areas.
Comparing Seasonal Differences, Daily Activity
The data can be partitioned further by month, which will allow an engineer to see (for example) if more energy is consumed during the summer or winter. This breakdown is provided in Figures 5 through 7. They show that households consume significantly more power in the winter in all three years than summer. The demo data came from a region where summers are short and winters are very cold.
Figure 5. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 1.
Figure 6. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 2.
Figure 7. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 3.
Figure 8 shows the daily activity during January in the second year. Energy consumption in the kitchen is high in the morning and evening hours. Laundry activity is sporadic. Cooling and heating activity is persistent and the highest among the three areas. The new views agree with the earlier observations that most of the household consumption is in the heating and cooling area, and it is constant for most of January.
Figure 9 shows daily activity during July in the second year. Kitchen and laundry area activity is like January: higher activity in the early morning and evening hours. While the cooling and heating area shows spikes during the early and late hours of the day, it stays lower compared to the winter months.
Figure 8. Daily activity trends for January of Year 2 show activity in all three regions.
Figure 9. Daily activity trend for July of Year 2.
Data continues to provide opportunities for analysis and insights, but its sheer volume can be overwhelming. The use case presented here demonstrates how advanced industrial analytics can accommodate large datasets for quick evaluation. But that is not the only advantage of applying advanced industrial analytics to find gems in big data.
Among its benefits? Advanced industrial analytics empowers engineers to:
- Reduce the energy use of a conveyor belt,
- Detect and react to pipe network anomalies,
- Find the right time to clean aeration elements,
- Eliminate unplanned pump shutdowns with real-time monitors and alerts, and