Blog

Too Much for a Spreadsheet?

Here’s How You Can Easily Analyze Big Data 

The more rows of data in set, the more difficult it can be to evaluate with traditional methods 

The rise in data is omnipresent thanks to technological advancements that have digitalized just about everything. From power consumption of household laundry areas to the behavior of a heat exchanger on the factory floor, the amount of data available for analysis can be overwhelming. 

There is much more that can be learned from data that is already available. But tapping into it can be a daunting task for someone who is not a data scientist. The larger the dataset, the more intimidating it can be. 

Large datasets require special solutions. For demonstration, a simulated dataset for the period of January 2019 through December 2021 was created based on data in the Household Power Consumption dataset from the UCI Machine Learning Repository. The multivariate time-series dataset has 2,075,259 rows of one-minute averages. These are the details of the dataset: 

AttributesMeaning
date

time

global_active_power

global_reactive_power

voltage

global_intensity

sub_metering_1

sub_metering_2

sub_metering_3

Date

Time

The total active power consumed (kW)

The total reactive power consumed (kW)

Volts (V)

Current intensity (A)

Kitchen area (Wh)

Laundry area (Wh)

Air-conditioner & heat (Wh)

Because the dataset is so large, it cannot be evaluated using a spreadsheet. The sheet runs out of rows to handle the set. Traditionally, a data scientist would apply algorithms or computer code to sort the data into manageable rows. But waiting on a data scientist simply because a spreadsheet cannot handle a dataset is not always possible. Instead, you can apply advanced industrial analytics to explore this extensive data. 

Overview of the Data 

The plotted data in Figure 1 shows that household power consumption dropped during August in the second year. It demonstrates the intensity of power consumed and provides an engineer with a clear picture of the anomaly. Because energy use that was measured is consumed in three areas (kitchen, laundry, and heating and cooling rooms), it is necessary to take a closer look at how consumption dropped during that time in each area. Figure 2 shows that the consumption in the kitchen and laundry areas stopped entirely. The heating and cooling area was active, although consumption was lower than what would be considered normal. 

Figure 1. Plotted data for June 2019 through December 2021 shows active power consumption dropped significantly during August 2020. 
Figure 2. Active power consumption dropped or ended in all three regions during the anomaly. 

Engineers often find that exploring the relationship between all variables of a multivariable dataset in a single visualization allows them to detect positive or negative relationships between them. Figure 3 shows the multivariate relationship between different attributes. One such relationship between global_intensity (Amps) and global_active_power (kW), with a coefficient of determination (R2) of 0.99, is highlighted by the blue box. The strong positive relationship between power and current should not be a surprise; otherwise, it would violate Ohm’s law. In addition to the bivariate relationships, the histograms of each attribute are highlighted by red boxes. These are extracted to show as one image in Figure 4. 

Figure 3. Multiscatterplot relationship of all attributes except global_reactive_power.”
Figure 4. Distributions of all attributes except global_reactive_power.”

Among all attributes, voltage is the only variable with a normal distribution. Kitchen and laundry areas mostly peak at exceptionally low consumption, whereas the heating and cooling area has a solid bi-modal distribution. This shows that the HVAC (Heating, Ventilation, and Air Conditioning) system consumes much more energy than the kitchen or laundry areas. 

Comparing Seasonal Differences, Daily Activity 

The data can be partitioned further by month, which will allow an engineer to see (for example) if more energy is consumed during the summer or winter. This breakdown is provided in Figures 5 through 7. They show that households consume significantly more power in the winter in all three years than summer. The demo data came from a region where summers are short and winters are very cold. 

Figure 5. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 1. 
Figure 6. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 2. 
Figure 7. Power consumption comparison between the winter months (December and January) and summer months (July and August) for Year 3. 

Figure 8 shows the daily activity during January in the second year. Energy consumption in the kitchen is high in the morning and evening hours. Laundry activity is sporadic. Cooling and heating activity is persistent and the highest among the three areas. The new views agree with the earlier observations that most of the household consumption is in the heating and cooling area, and it is constant for most of January. 

Figure 9 shows daily activity during July in the second year. Kitchen and laundry area activity is like January: higher activity in the early morning and evening hours. While the cooling and heating area shows spikes during the early and late hours of the day, it stays lower compared to the winter months. 

Figure 8. Daily activity trends for January of Year 2 show activity in all three regions.  
Figure 9. Daily activity trend for July of Year 2. 

Concluding Thoughts 

Data continues to provide opportunities for analysis and insights, but its sheer volume can be overwhelming. The use case presented here demonstrates how advanced industrial analytics can accommodate large datasets for quick evaluation. But that is not the only advantage of applying advanced industrial analytics to find gems in big data. 

Among its benefits? Advanced industrial analytics empowers engineers to: 

Ready to get started with advanced industrial analytics? Begin your journey here. 

Webinar 25 Oct - Advanced Analytics for the Process Manufacturing Industry

Author:

From our Blog

Analytics in blowout preventers testing

How Deepwater Subsea is harnessing new technology for Blowout Preventers testing

,
Implementing self-service analytics not only saves time but provides greater insight, especially in critical use cases such as blowout preventers testing. In the days following the Deepwater Horizon oil spill of April 2010, in which the…
Background image

Self-Service Continuous Improvement 4.0 – Part II: Implementation on an Organizational Scale

In the first part of this series of posts, we looked at how self-service analytics can be applied to virtually all phases of the Six Sigma DMAIC cycle, adding fuel to your continuous improvement projects by allowing subject matter experts…