↓ Skip to main content

Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees

Overview of attention for article published in Water Research, September 2018
Altmetric Badge

Mentioned by

twitter
2 X users

Citations

dimensions_citation
19 Dimensions

Readers on

mendeley
50 Mendeley
Title
Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees
Published in
Water Research, September 2018
DOI 10.1016/j.watres.2018.09.009
Pubmed ID
Authors

Yao Hu, Donald Scavia, Branko Kerkez

Abstract

As more sensor data become available across urban water systems, it is often unclear which of these new measurements are actually useful and how they can be efficiently ingested to improve predictions. We present a data-driven approach for modeling and predicting flows across combined sewer and drainage systems, which fuses sensor measurements with output of a large numerical simulation model. Rather than adjusting the structure and parameters of the numerical model, as is commonly done when new data become available, our approach instead learns causal relationships between the numerically-modeled outputs, distributed rainfall measurements, and measured flows. By treating an existing numerical model - even one that may be outdated - as just another data stream, we illustrate how to automatically select and combine features that best explain flows for any given location. This allows for new sensor measurements to be rapidly fused with existing knowledge of the system without requiring recalibration of the underlying physics. Our approach, based on Directed Information (DI) and Boosted Regression Trees (BRT), is evaluated by fusing measurements across nearly 30 rain gages, 15 flow locations, and the outputs of a numerical sewer model in the city of Detroit, Michigan: one of the largest combined sewer systems in the world. The results illustrate that the Boosted Regression Trees provide skillful predictions of flow, especially when compared to an existing numerical model. The innovation of this paper is the use of the Directed Information step, which selects only those inputs that are causal with measurements at locations of interest. Better predictions are achieved when the Directed Information step is used because it reduces overfitting during the training phase of the predictive algorithm. In the age of "big water data", this finding highlights the importance of screening all available data sources before using them as inputs to data-driven models, since more may not always be better. We discuss the generalizability of the case study and the requirements of transferring the approach to other systems.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 50 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 50 100%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 8 16%
Student > Master 7 14%
Student > Doctoral Student 7 14%
Student > Bachelor 3 6%
Professor 3 6%
Other 6 12%
Unknown 16 32%
Readers by discipline Count As %
Engineering 11 22%
Environmental Science 9 18%
Computer Science 3 6%
Business, Management and Accounting 2 4%
Earth and Planetary Sciences 2 4%
Other 3 6%
Unknown 20 40%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 11 September 2018.
All research outputs
#19,954,338
of 25,385,509 outputs
Outputs from Water Research
#7,639
of 11,877 outputs
Outputs of similar age
#252,947
of 345,275 outputs
Outputs of similar age from Water Research
#141
of 204 outputs
Altmetric has tracked 25,385,509 research outputs across all sources so far. This one is in the 18th percentile – i.e., 18% of other outputs scored the same or lower than it.
So far Altmetric has tracked 11,877 research outputs from this source. They receive a mean Attention Score of 5.0. This one is in the 31st percentile – i.e., 31% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 345,275 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 22nd percentile – i.e., 22% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 204 others from the same source and published within six weeks on either side of this one. This one is in the 24th percentile – i.e., 24% of its contemporaries scored the same or lower than it.