Dataset Abstract:

The tap_water_OH dataset encompasses a comprehensive collection of tap water samples from seven populous metropolitan areas in the western United States, sampled between 2012 and 2015. The areas include Los Angeles-Long Beach-Santa Ana, Phoenix-Mesa-Glendale, Salt Lake City, San Diego-Carlsbad-San Marcos, San Francisco-Oakland-Fremont, San Jose-Sunnyvale-Santa Clara, and Riverside-San Bernardino-Ontario. The study focuses on measuring hydrogen (δ2H) and oxygen (δ18O) isotopes to assess the influence of water sources and management practices on the isotope and elemental geochemistry of tap waters. Precipitation, average temperature, and elevation climate data were downloaded from the NCDC’s archive, which holds a record of global historical weather and climate data. The temp_precip_7cities dataset covers the period from January 1, 2012, to December 31, 2015. Combine the tap water data with the climate data into a single dataset, which will be referred to as ‘OH_Climate_dataset’.

Content

Basic Statistical Plots:
- Single Variable Statistic Plots
- Bivariate Plots
Clustering Analysis:
Multiple Variable Plots:
Mapping:
- Basic Visualizations on Map
Geospatial analysis:
- Spatial Clustering of Observations
- Multiple Linear Regression

1 Basic Statistical Plots

Single Variable Statistic Plots

# Load necessary libraries
library(ggplot2)
library(gridExtra)
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(corrplot)

## corrplot 0.92 loaded

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:plyr':
## 
##     ozone

library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(gstat)
library(spdep)

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(spatialreg)

## Loading required package: Matrix

## 
## Attaching package: 'spatialreg'

## The following objects are masked from 'package:spdep':
## 
##     get.ClusterOption, get.coresOption, get.mcOption,
##     get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
##     set.coresOption, set.mcOption, set.VerboseOption,
##     set.ZeroPolicyOption

# Load the OH dataset
OH_Climate_dataset_path <- "/Users/zhangshiyu/Desktop/R Course/Final/OH_Climate_dataset.csv"
data <- read_csv(OH_Climate_dataset_path)

## Rows: 2122 Columns: 14

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): Event, Area
## dbl  (11): Sample ID, Location, δ18O H2O [‰ SMOW], δD H2O [‰ SMOW], δ18O H2O...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data <- data %>%
  filter(LATITUDE >= 24.396308, LATITUDE <= 49.384358,
         LONGITUDE >= -125.001651, LONGITUDE <= -66.93457)
data$DATE <- as.Date(data$DATE)
# Interpolate missing TAVG values
data$TAVG <- na.approx(data$TAVG, rule=2)

## Single Variable Statistic Plots
# Histogram of δ18O H2O [‰ SMOW]
p1 <-  ggplot(data, aes(x=`δ18O H2O [‰ SMOW]`)) +
  geom_histogram(binwidth=0.5, fill="cornflowerblue", color="black") +
  ggtitle("Histogram of δ18O H2O [‰ SMOW]") +
  xlab("δ18O H2O [‰ SMOW]") + ylab("Frequency") +
  theme_minimal()

# Histogram of δ18O H2O [‰ SMOW] with KDE
p2 <- ggplot(data, aes(x=`δ18O H2O [‰ SMOW]`)) +
  geom_histogram(aes(y=..density..), bins=20, fill="skyblue", color="black") +
  geom_density(alpha=.2, fill="skyblue") +
  ggtitle("Histogram of δ18O H2O [‰ SMOW]") +
  theme_minimal()

# Histogram of δD H2O [‰ SMOW]
p3 <- ggplot(data, aes(x=`δD H2O [‰ SMOW]`)) +
  geom_histogram(binwidth=5, fill="lightcoral", color="black") +
  ggtitle("Histogram of δD H2O [‰ SMOW]") +
  xlab("δD H2O [‰ SMOW]") + ylab("Frequency") +
  theme_minimal()

# Histogram of δD H2O [‰ SMOW] with KDE
p4 <- ggplot(data, aes(x=`δD H2O [‰ SMOW]`)) +
  geom_histogram(aes(y=..density..), bins=20, fill="lightgreen", color="black") +
  geom_density(alpha=.2, fill="lightgreen") +
  ggtitle("Histogram of δD H2O [‰ SMOW]") +
  theme_minimal()

# Line plot for Precipitation  over time
p5 <- ggplot(data, aes(x=DATE, y=PRCP)) +
  geom_line(color="tomato") +
  ggtitle("Precipitation (PRCP) Over Time") +
  xlab("Date") + ylab("Precipitation") +
  theme_minimal()

# Line plot for Average Temperature (TAVG) over time
p6 <- ggplot(data, aes(x=DATE, y=TAVG)) +
  geom_line(color="tomato") +
  ggtitle("Average Temperature (TAVG) Over Time") +
  xlab("Date") + ylab("Average Temperature (°C)") +
  theme_minimal()

# Arrange the plots into a 3x2 grid
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=2)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The single-variable statistical plots provided a foundational understanding of the distribution and central tendencies of individual variables within my dataset. Through histograms and density plots, I can identify skewness, outliers, and the general shape of data distributions.

Bivariate Plots

## Bivariate plots
# Scatter Plot of δ18O vs δD
p7 <- ggplot(data, aes(x=`δ18O H2O [‰ SMOW]`, y=`δD H2O [‰ SMOW]`, color=Area)) +
  geom_point() +
  ggtitle("δ18O vs δD by Area") +
  xlab("δ18O H2O [‰ SMOW]") + ylab("δD H2O [‰ SMOW]") +
  theme_minimal() +
  scale_color_viridis_d() # Optional: Use a different color palette
plot(p7)

The scatter plot depicts the relationship between δ18O and δD(Deuterium: doo·tee·ree·uhm) values in tap water samples, color-coded by geographic areas: Arizona, California, one-time collection, and Salt Lake Valley, showing a positive correlation indicative of regional isotopic signatures.

2 Clustering Analysis

## Clustering Analysis
# Prepare the isotopic data for clustering
isotopic_data <- data[, c("δ18O H2O [‰ SMOW]", "δD H2O [‰ SMOW]")]
isotopic_data <- na.omit(isotopic_data) # Removing rows with NA values to ensure kmeans can run

# Perform K-means clustering with 3 clusters
set.seed(123) # Set seed for reproducibility
kmeans_result <- kmeans(isotopic_data, centers = 3) 
isotopic_data$cluster <- as.factor(kmeans_result$cluster) # Add the cluster assignments to your data

# Create the scatter plot
p8 <- ggplot(isotopic_data, aes(x=`δ18O H2O [‰ SMOW]`, y=`δD H2O [‰ SMOW]`, color=cluster)) +
  geom_point(alpha=0.7, size=2) +  # Adjust point transparency and size
  scale_color_manual(values=c("skyblue", "lightgreen", "salmon")) +  # Custom cluster colors
  labs(title="Clustering of Isotopic Compositions", 
       x="δ18O H2O [‰ SMOW]", y="δD H2O [‰ SMOW]") +
  theme_minimal(base_size = 15) +  # Use a minimal theme with adjusted base font size
  theme(panel.grid.major = element_line(color = "grey", size = 0.5),  # Add major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        panel.background = element_blank(),  # Remove panel background
        axis.line = element_line(color = "black"))  # Add axis lines

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot(p8)

This scatter plot shows the clustering of water samples based on their isotopic composition, with each cluster (1, 2, and 3) representing a distinct group characterized by similar δ18O and δD values.

3 Mutipule Variable Plots

PCA Plot

## Mutipule Variable Plots
# Select only the numeric columns
numeric_data <- data[, sapply(data, is.numeric)]
numeric_data <- na.omit(numeric_data)

# Perform K-means clustering
kmeans_result_2 <- kmeans(numeric_data, centers = 3)
# Apply PCA for visualization purposes
pca_result <- prcomp(numeric_data, scale. = TRUE)

# Create a dataframe for plotting that includes PCA scores and cluster assignments
plot_data <- data.frame(PC1 = pca_result$x[, 1], PC2 = pca_result$x[, 2], 
                        Cluster = as.factor(kmeans_result_2$cluster))

# PCA plot with ggplot2 showing clusters
p9 <- ggplot(plot_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3")) +  # Adjust colors as needed
  labs(title = "PCA of Clustered Data (All Numeric Variables)", 
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal()
plot(p9)

The PCA plot visualizes the distribution of clustered data points based on all numeric variables, with points grouped into three distinct clusters along the first two principal components reflecting the greatest variance.

Correlation Heatmap

# Plot of correlation heatmap
cor_matrix <- cor(numeric_data, use = "complete.obs") # Correlation matrix 
p10 <- corrplot(cor_matrix, method = "color",
               order = "hclust", tl.col = "black", tl.srt = 45, 
               # Add more customization options as needed
               addCoef.col = "black",  # Add correlation coefficients
               diag = FALSE)  # Do not show diagonal

There is a very strong positive correlation between δ18O and δD values in the water samples, indicating that these isotopes vary together and possibly have a common source or process affecting their concentrations.
Elevation shows a strong positive correlation with both δ18O and δD, suggesting that altitude may play a significant role in influencing the isotopic composition of the water.
Latitude has a moderately strong negative correlation with δ18O and δD, which may imply that geographical location latitudinally impacts the isotopic signatures.
There is a notable positive correlation between sample location and both δ18O and δD, indicating that the geographical source of the water samples is a significant factor in their isotopic composition.
Temperature (TAVG) has a moderate negative correlation with δ18O, suggesting that temperature influences isotopic fractionation.
Sample ID, longitude, and precipitation (PRCP) show low to negligible correlations with δ18O and δD, suggesting they have little to no linear relationship with the isotopic composition of the samples in this dataset.

Parallel Coordinate Plot

# Parallel coordinate plot
# Select numeric variables for the parallel coordinate plot
parallel_vars <- data[, c("Area", "δ18O H2O [‰ SMOW]", "δD H2O [‰ SMOW]", "PRCP", "TAVG", "ELEVATION")]
parallel_vars$Area <- as.factor(parallel_vars$Area)
data_imputed <- sapply(parallel_vars, is.numeric)
parallel_vars[data_imputed] <- lapply(parallel_vars[data_imputed], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
p11 <- ggparcoord(parallel_vars, columns = 2:6, groupColumn = 1, order = "anyClass") +
  theme_minimal() +
  labs(title = "Parallel Coordinate Plot by Group") +
  scale_color_viridis_d() 
plot(p11)

The parallel coordinate plot illustrates multi-variable relationships and comparisons across different areas, with each line representing a set of related data points for Arizona, California, one-time collection, and Salt Lake Valley across various variables such as elevation, isotopic compositions, temperature, and precipitation.

4 Mapping

Basic Visualizations on Map

## Mapping
# Base map with state boundaries
usa_map <- map_data("state")

# Plotting δ18O H2O [‰ SMOW] values on the map
m1 <-ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color = `δ18O H2O [‰ SMOW]`), 
             size = 2, alpha = 0.7) +
  scale_color_viridis_c(name = expression(delta^18*"O H2O [‰ SMOW]")) +
  labs(title = expression(delta^18*"O H2O Values Across Locations")) +
  theme_minimal() +
  coord_fixed(1.3) +
  theme(legend.position = "right")
plot(m1)

# Plotting δD H2O [‰ SMOW] values on the map
m2 <-ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color = `δD H2O [‰ SMOW]`), 
             size = 2, alpha = 0.7) +
  scale_color_viridis_c(name = expression(delta*D*" H2O [‰ SMOW]")) +
  labs(title = expression(delta*D*" H2O Values Across Locations")) +
  theme_minimal() +
  coord_fixed(1.3)
plot(m2)

# Visualization of Precipitation (PRCP)
m3 <- ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color = PRCP), 
             size = 2, alpha = 0.7) +
  scale_color_viridis_c(name = "Precipitation (mm)") +
  labs(title = "Precipitation Values Across Locations") +
  theme_minimal() +
  coord_fixed(1.3)
plot(m3)

# Visualization of Average Temperature (TAVG)
m4 <- ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color = TAVG), 
             size = 2, alpha = 0.7) +
  scale_color_viridis_c(name = "Average Temperature (°C)") +
  labs(title = "Average Temperature Values Across Locations") +
  theme_minimal() +
  coord_fixed(1.3)
plot(m4)

# Visualization of Elevation (ELEVATION)
m5 <- ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color = ELEVATION), 
             size = 2, alpha = 0.7) +
  scale_color_viridis_c(name = "Elevation (m)") +
  labs(title = "Elevation Values Across Locations") +
  theme_minimal() +
  coord_fixed(1.3)
plot(m5)

5 Geospatial analysis

Spatial Clustering of Observations

## Geospatial analysis
# Basic Geospatial Visualization
data_sf <- st_as_sf(data, coords = c("LONGITUDE", "LATITUDE"), crs = 4326) # Convert to an sf object
# Perform DBSCAN clustering (example parameters)
coords <- st_coordinates(data_sf)# Extract coordinates
geo_cluster_result <- dbscan(coords, eps = 0.1, minPts = 5)
data_sf$cluster <- geo_cluster_result$cluster # Add cluster assignments back to the data

# Plot of Spatial Clustering of Observations with US State Boundaries
m6 <- ggplot() +
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), fill = "white", color = "black", size = 0.25) +
  geom_sf(data = data_sf, aes(color = cluster), show.legend = "point") + 
  labs(title = "Spatial Clustering of Observations with US State Boundaries") +
  theme_minimal() +
  theme(legend.position = "right")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot(m6)

There is a very strong positive correlation between δ18O and δD in water samples, indicating that these isotopic variables tend to increase and decrease together.
Elevation appears to have a strong positive correlation with δ18O and δD, suggesting higher isotope values at higher elevations.
Latitude shows a negative correlation with δ18O and δD, possibly indicating that isotopic values decrease as one moves northward.
There is a moderate to strong positive correlation between elevation and latitude, which could indicate that the sampling locations at higher latitudes tend to also be at higher elevations.
Average temperature (TAVG) has a moderate negative correlation with δ18O and δD standard deviations, suggesting that with higher average temperatures, the variability in isotopic composition decreases.
Precipitation (PRCP) does not show a strong correlation with the isotopic variables, implying that it may not be a direct predictor of δ18O and δD values in this dataset.

Multiple Linear Regression

# Multiple linear regression including cluster as a categorical variable
data_sf$cluster <- as.factor(data_sf$cluster)
model_with_clusters <- lm(`δ18O H2O [‰ SMOW]` ~ `δD H2O [‰ SMOW]` + PRCP + TAVG + ELEVATION + cluster, data = data_sf)
summary(model_with_clusters)

## 
## Call:
## lm(formula = `δ18O H2O [‰ SMOW]` ~ `δD H2O [‰ SMOW]` + 
##     PRCP + TAVG + ELEVATION + cluster, data = data_sf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97702 -0.27006 -0.04904  0.24493  2.05965 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -8.573e-01  7.687e-02 -11.153  < 2e-16 ***
## `δD H2O [‰ SMOW]`  1.249e-01  5.302e-04 235.546  < 2e-16 ***
## PRCP              -7.753e-02  6.198e-02  -1.251  0.21107    
## TAVG               8.979e-03  8.171e-04  10.988  < 2e-16 ***
## ELEVATION         -6.720e-05  4.556e-05  -1.475  0.14037    
## cluster1          -1.686e-01  9.692e-02  -1.739  0.08215 .  
## cluster2          -1.800e-01  6.383e-02  -2.820  0.00485 ** 
## cluster3           8.009e-01  7.531e-02  10.634  < 2e-16 ***
## cluster4          -2.540e-01  6.319e-02  -4.020 6.01e-05 ***
## cluster5           3.271e-01  5.857e-02   5.585 2.63e-08 ***
## cluster6           1.423e-01  1.680e-01   0.847  0.39694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4514 on 2097 degrees of freedom
## Multiple R-squared:  0.9858, Adjusted R-squared:  0.9858 
## F-statistic: 1.461e+04 on 10 and 2097 DF,  p-value: < 2.2e-16

# Find the k-nearest neighbors for each point
knn_result <- knearneigh(coords, k=4)

## Warning in knearneigh(coords, k = 4): knearneigh: identical points found

## Warning in knearneigh(coords, k = 4): knearneigh: kd_tree not available for
## identical points

# Convert the result to a neighbors list
neighbors <- knn2nb(knn_result, sym=TRUE) # sym=TRUE makes the neighbors relationship symmetric
# Convert neighbors list to a listw object
listw <- nb2listw(neighbors, style="W", zero.policy=TRUE)

# Fit a spatial lag model using spatialreg
model_lag <- spatialreg::lagsarlm(`δ18O H2O [‰ SMOW]` ~ `δD H2O [‰ SMOW]` + PRCP + TAVG + ELEVATION, data = data_sf, listw = listw)
summary(model_lag)

## 
## Call:
## spatialreg::lagsarlm(formula = `δ18O H2O [‰ SMOW]` ~ `δD H2O [‰ SMOW]` + 
##     PRCP + TAVG + ELEVATION, data = data_sf, listw = listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.091203 -0.335917 -0.072859  0.287751  2.455920 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                      Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)       -8.0419e-01  6.1189e-02 -13.1426 < 2.2e-16
## `δD H2O [‰ SMOW]`  1.1871e-01  8.9452e-04 132.7112 < 2.2e-16
## PRCP              -3.1016e-01  6.8273e-02  -4.5429 5.547e-06
## TAVG               8.6750e-03  9.0177e-04   9.6199 < 2.2e-16
## ELEVATION         -1.0273e-04  1.6341e-05  -6.2864 3.249e-10
## 
## Rho: 0.051343, LR test value: 39.743, p-value: 2.8966e-10
## Asymptotic standard error: 0.0083115
##     z-value: 6.1773, p-value: 6.5193e-10
## Wald statistic: 38.159, p-value: 6.5193e-10
## 
## Log likelihood: -1561.224 for lag model
## ML residual variance (sigma squared): 0.2574, (sigma: 0.50735)
## Number of observations: 2108 
## Number of parameters estimated: 7 
## AIC: 3136.4, (AIC for lm: 3174.2)
## LM test for residual autocorrelation
## test value: 942.08, p-value: < 2.22e-16

Multiple Linear Regression:
- δD H2O and TAVG have highly significant positive effects on δ18O H2O.
- Clusters 2, 3, 4, and 5 show significant effects, indicating that water samples from these clusters have distinct isotopic signatures.
- Precipitation and Elevation are not significant predictors in this model.
- The model explains a very high proportion of the variance (R-squared = 0.9858), indicating a good fit.
Spatial Lag Model:
- The inclusion of the spatial lag component (Rho) indicates that there is spatial dependence in the data.
- PRCP becomes significant with a negative coefficient, suggesting an inverse relationship with δ18O H2O when accounting for spatial dependence.
- Elevation also becomes significant with a negative coefficient, differing from the multiple linear regression model.
- The AIC is lower than that of the non-spatial model, suggesting a better fit with the inclusion of the spatial lag.

Conclusions:

Isotopic composition in water is heavily influenced by δD H2O levels and average temperature, and these relationships are spatially dependent.
Clustering based on isotopic composition is meaningful and should be considered in the analysis of water samples.
Spatial effects are significant and improve the model’s prediction capabilities.
Precipitation has a more complex relationship with isotopic composition than initially indicated, becoming significant when spatial correlation is considered.

Final Project: Shiyu Zhang

Introduction: