• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

You Tube Flickr Twitter UniPHY Group iResearch App Facebook

Author Select

FULL-TEXT OPTIONS:

Chaos 22, 013111 (2012); http://dx.doi.org/10.1063/1.3675621 (25 pages)

Using time-delayed mutual information to discover and interpret temporal correlation structure in complex populations

D. J. Albers and George Hripcsak

Department of Biomedical Informatics, Columbia University, 622 West 168th Street, VC-5, New York, New York 10032, USA

View MapView Map

(Received 15 October 2010; accepted 8 December 2011; published online 24 January 2012)

This paper addresses how to calculate and interpret the time-delayed mutual information (TDMI) for a complex, diversely and sparsely measured, possibly non-stationary population of time-series of unknown composition and origin. The primary vehicle used for this analysis is a comparison between the time-delayed mutual information averaged over the population and the time-delayed mutual information of an aggregated population (here, aggregation implies the population is conjoined before any statistical estimates are implemented). Through the use of information theoretic tools, a sequence of practically implementable calculations are detailed that allow for the average and aggregate time-delayed mutual information to be interpreted. Moreover, these calculations can also be used to understand the degree of homo or heterogeneity present in the population. To demonstrate that the proposed methods can be used in nearly any situation, the methods are applied and demonstrated on the time series of glucose measurements from two different subpopulations of individuals from the Columbia University Medical Center electronic health record repository, revealing a picture of the composition of the population as well as physiological features.

© 2012 American Institute of Physics

Article Outline

  1. INTRODUCTION
    1. A reader’s guide: The outline of this paper
  2. MOTIVATING EXAMPLES
  3. INFORMATION THEORY BACKGROUND
    1. Average TDMI
    2. Aggregate TDMI
  4. TDMI-SPECIFIC ESTIMATOR BIASES
    1. Sample size dependent estimator bias effects
    2. Fixed point bias estimate for average and aggregate populations
    3. Non-estimator bias: How the TDMI calculation can act as a population filter
      1. Methods for assessing δt bin compositions
  5. POPULATION-BASED DEVIATIONS FROM THE INDIVIDUAL TDMI ESTIMATES
    1. Heterogeneity-based deviations from the individual: Average TDMI case
      1. Entropy of the averaged population
    2. Heterogeneity-based deviations from the individual: Aggregate TDMI case
      1. Entropy of the aggregated population
  6. HOW TO INTERPRET THE TDMI FOR A POPULATION, OR, TDMI-BASED METHODS FOR INTERPRETING POPULATION DIVERSITY
    1. Support dependent, graph independent, effects on the population TDMI
    2. Graph dependent, support independent, effects on the population TDMI
    3. Support dependent, graph-based effects on the population TDMI
  7. NON-TDMI-BASED METHODS FOR INTERPRETING POPULATION DIVERSITY
    1. Homogeneity in measurement composition
    2. Homogeneity in measurement distribution supports
    3. Homogeneity in the distribution of the graphs of the measurement PDFs
  8. ASSEMBLING THE PIECES: AN EXPLICIT PRESCRIPTION FOR TDMI ANALYSIS AND INTERPRETATION FOR A POPULATION OF TIME SERIES FOR A FIXED TIME SEPARATION δt
    1. Step one: Determining the computability of math(δt)
    2. Step two (A in Fig. ): Interpreting δ It ) or math(δt)
    3. Step three (B in Fig. ): Assessing population representation
  9. QUANTITATIVE EXAMPLES FOR TDMI INTERPRETATION AND POPULATION HOMOGENEITY EVALUATION
    1. Simulated data examples: The quadratic map and the Gauss map
      1. TDMI-based analysis of the simulated data
      2. Non-TDMI-based analysis of the simulated data
      3. Quantifying small sample-size effects
    2. Real data examples: Glucose values for 100 densely sampled individuals versus 20,000 random individuals
      1. TDMI-based analysis for data set 7, the well measured population
      2. Non-TDMI-based analysis for data set 7, the well measured population
      3. TDMI-based analysis for data set 8, the random (less well measured) population
      4. Non-TDMI-based analysis for data set 8, the random (less well measured) population
      5. Analysis of the TDMI under variation of δt
  10. DISCUSSION AND COMMENTS
    1. Specific results of the interpretative framework relative to real data
    2. Using categorical billing code data to help verify the TDMI analysis
    3. How our method addresses nonstationarity
    4. Comments regarding the connection between the supports and the normalizations of the distributions
    5. Future directions regarding the use of this technique
    6. Some remaining statistical problems
  11. SUMMARY

KEYWORDS and PACS

Keywords

delays, time series

PACS

  • 05.45.Tp

    Time series analysis

  • 02.50.-r

    Probability theory, stochastic processes, and statistics

ARTICLE DATA

PUBLICATION DATA

ISSN

1054-1500 (print)  
1089-7682 (online)

  1. C. Komalapriya, M. Thiel, M. C. Ramano, N. Marwan, U. Schwarz, and J. Kurths, Phys. Rev. E 78, 066217 (2008).
  2. J. C. Sprott , Chaos and Time-series Analysis (Oxford University Press, New York, 2003).
  3. H. Kantz and T. Schreiber , Nonlinear Time Series Analysis, 2nd ed. (Cambridge University Press, UK, 2003).
  4. W. Hogan and M. Wagner, J. Am. Med. Inform Assoc. 5, 342 (1997).
  5. J. van der Lei, Methods Inf. Med. 30, 79 (1991).
  6. H. Sagreiya and R. B. Altman, J. Biomed. Inf. 43, 747 (2010).
  7. J. M. Higgins and L. Mahadevan, Proc. Natl. Acad. Soc. U.S.A. 107, 20587 (2010).
  8. E. Shudo, R. M. Ribeiro, and A. S. Perelson, J. Viral Hepat. 15, 357 (2008).
  9. M. S. Turner, Phys. Today 62, 8 (2009).
  10. J. D. Scargle, Astrophys. J. 263, 835 (1982).
  11. S. Baisch and G. H. R. Bokelmann, Comput. Geosci. 25, 739 (1999).
  12. M. Schulta and K. Stattegger, Comput. Geosci. 23, 929 (1997).
  13. A. W. C. Liew, J. Xian, S. Wu, D. Smith, and H. Yan, BMC Bioinf. 8, 137 (2007).
  14. L. Wasserman , All of Statistics: A Concise Course in Statistical Inference, (Springer, New York, 2004).
  15. M. Loéve , Probability Theory I (Springer-Verlag, 1977).
  16. A. G. Gray and A. W. Moore , “Very fast multivariate kernel density estimation using via computational geometry,” in Joint Stat. Meeting (August 4th, 2003).
  17. Y.-I. Moon, B. Rajagopalan, and U. Lall, Phys. Rev. E 52, 2318 (1995). [ISI] [MEDLINE]
  18. R. J. May, G. C. Dandy, H. R. Maier, and T. M. K. G. Fernando , “Critical values of a kernel density-based mutual information estimator,” in International Joint Conference on Neural Networks (IEEE, Vancouver, BC, 2006).
  19. D. J. Albers and G. Hripcsak , Estimation of time-delayed mutual information from sparsely sampled sources, e-print arXiv:1110.1615, 2011.
  20. R. L. Wheeden and A. Zygmund , “Measure and integral,” in Monographs and Textbooks in Pure and Applied Mathematics (Marcel Dekker, Inc., New York, 1977), Vol. 43.
  21. G. P. Basharin, Theor. Probab. Appl. 4, 333 (1959)TPRBAU000004000003000333000001.
  22. M. S. Roulston, Physica D 125, 285 (1999). [Inspec] [ISI]
  23. J. Graxzyk and G. Światek, Ann. Math. 146, 1 (1997). [ISI]
  24. M. Jakobson, Commun. Math. Phys. 81, 39 (1981). [ISI]
  25. D. J. Albers and G. Hripcsak, Phys. Lett. A 374, 1159 (2010).
  26. D. J. Albers and G. Hripcsak , Using population scale EHR data to understand and test human physiological dynamics, e-print arXiv:1110.3317, 2011.
  27. It may seem odd to normalize indices, but this just keeps the domain of  Theta-tilde between zero and one.
  28. To see the variation in the PDF estimates due to small sample sizes, observe the PDF estimates for different sets of uniform random numbers with small cardinality.
  29. Note, the L1 difference is not technically a distance function or a metric because it does not satisfy the triangle inequality.

Figures (6) Tables (11)

Figures (click on thumbnails to view enlargements)

FIG.1
(Color) Graphically comparing math (average PDF) and math (PDF of the aggregate) for a collection of three collections of Gaussian random numbers whose distributions have means 0, 2, and 4 respectively.

FIG.1 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.2
The graphical schematic for the TDMI analysis of a population; note that by TDMI Present, we mean that the relevant TDMI measure (e.g., math(δt)) is greater than bias.

FIG.2 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.3
(Color) The graphs of the quadratic map (Eq. ( 50 )) and the Gauss map (Eq. ( 51 ))—note the significant difference between the graphs of the mappings, and invariant density (PDF of the orbit) for the quadratic map, Gauss map, and the sum of the quadratic and Gauss maps—note the significant differences between the relative p’s.

FIG.3 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.4
(Color) PDFs of glucose measurements for individuals within a population and for a population for two data sets, the 100 patients with the largest records and 20 000 random patients.

FIG.4 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.5
(Color) Comparisons of the supports and PDF graph variations for two data sets, the 100 patients with the largest records and 5000 random patients.

FIG.5 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.6
(Color online) The TDMI for both math and math with δt bins of 6 h for a period of a few days for D7 and D8; note that the bias estimates can be found in Tables 8 , 7. With respect to (a), note the following: for δt ≤ 6 h, δI > 0 and for δt > 6 h, δI ≈ 0; the KDE and histogram estimates are extremely similar; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both math and math. With respect to (b), note the following: for all δt δI is consistent and likely zero within bias; the KDE and histogram estimates differ greatly, implying the presence of small sample size effects in the average TDMI calculation; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both math and math in all but the KDE estimated TDMI average.

FIG.6 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

Tables

Table I. Summary of all the non-TDMI based metrics used to assess homogeneity in a population (both among the graphs and the supports) used to verify the TDMI-type analysis.

View Table
Table II. Summary of all the TDMI-based metrics used to interpret the TDMI and determine the population composition.

View Table
Table III. Complete list of the simulated data sets.

View Table
Table IV. TDMI results and homogeneity metrics for the simulated data sets one through six.

View Table
Table V. Heuristic homogeneity metrics for the simulated data sets one through six.

View Table
Table VI. Complete list of the real patient data sets.

View Table
Table VII. TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.

View Table
Table VIII. TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.

View Table
Table IX. Time independent TDMI results for the real patient data sets seven and eight.

View Table
Table X. Heuristic homogeneity metrics for the real patient data sets seven and eight.

View Table
Table XI. How to interpret the TDMI for a population of time series

View Table


Close
Google Calendar
ADVERTISEMENT

close