Los Angeles is a city of around 4 Mio¹ inhabitants from diverse backgrounds (about half² of the population is latino) and had a crime rate of about 635³ per 100’000 inhabitants per year in 2015. The city hosts the biggest port complex in the US and is an illegal drugs hub⁴ for the country. Finding a way to foster this diversity so as to take advantage of the city’s economical potential and reduce crime would provide more room for children to develop, achieve high education levels and freely roam around the city.

The aim of this paper is to identify how demographic data at the zip code level in Los Angeles impact crime rates in 2010. More precisely, thw two following hypotheses are formulated and tested in the following:

the higher the number of people per household, the higher the crime rate in a certain zip code, in average.
the younger the population, the higher the crime rate in a certain zip code, in average.

For the purpose of the study, three data sets are used:

Crime Data from 2010 to Present⁵ A dataset reflecting incidents of crime and their latitude-Longitude coordinates in the City of Los Angeles dating back to 2010 (last update : August 9, 2008). The data is gathered by the LAPD and contains the following information: date and time of occurence and report of the crime, type of crime, demographic information about the victim and location of the crime.
2010 Census Populations by Zip Code⁶ A dataset coming from from the 2010 Census Profile of General Population and Housing Characteristics. It contains data by zip code that fall at least partially within LA city boundaries. It contains the following information: total population, median age, number of males and females, number of households and average household size.
Tiger Lines Zip codes⁷ A shapefile containing areas of zip codes in the US. A subset of the shapefile is coded in prep.R for the city of Los Angeles.

In the following, non-spatial regression analysis is used. This provides the basis for further exploration with spatial methods, i.e. bivariate spatial maps, centrographic statistics and LISA statistics based on Moran’s I. First, an exploratory data analysis is performed in order to assess the distribution of the variables of interest.

Descriptive Statistics

The histograms below (see Figure @ref(fig:histograms)) show that average household size and median age are centered around 3 and 40 respectively. Crime rate has a high number of 1, 2 and 3 values (the underlying data has no 0 values). This might be due to the lack of data in certain regions. Those 1s values were not dismissed in order to not dismiss zip code areas arbitrarily in the final analysis, instead the spatial plots use quantiles, which allow to categorize all low values together and allow for comparison with higher and more reasonable estimates of crime rates.

The interaction between the explanatory variables (median age and average household size) and the dependent variable (number of crimes per zip code) can be seen on the plot below. The natural log of number of crimes was used in order to standardize the data, since it is originally left skewed. It seems that that no particular relationship exists on a first look, this will be tested by the regressions below.

non-spatial bivariate regressions, non-spatial correlations

Below are two regression analyses for the two independent variables (average household size and median age). It is deceiving in terms of explained variance and of the explanatory value of the independent variable median age. Average household size is significantly different from zero at a 5% confidence level though: a bigger household would imply more crimes as stated in the hypothesis in the introduction.


	Dependent variable:

	log(crimes)

Average.Household.Size	0.675^**
	(0.305)
Median.Age	0.005
	(0.028)
Constant	3.398^**
	(1.441)

Observations	242
R²	0.020
Adjusted R²	0.012
Residual Std. Error	3.993 (df = 239)
F Statistic	2.457^* (df = 2; 239)

Note:	p<0.1; p<0.05; p<0.01

Those results are not surprising given the scatter plot shown in the previous section. This relative lack of relationship especially for median age does not entail that no trends can be found by treating the data as spatial data. This will be investigated further below.

2 variable maps

Number of Crimes VS Average Household Size

The spatial data delivers more detailed information on certain underlying relationships. Under the hypothesis stated in the introduction, one would expect that the more a circle tends to reddish (i.e. the higher the average household size), the more an area tends to blueish (the higher the number of crimes). This is however not the case, except for three areas north-west of Santa Monica.

Number of Crimes VS Median Age

Similarily, for median age against crime rates no particular relationship can be read from the plot.

Centrographic statistics and maps

In the following is an attempt of exploring the relationships based on centroids and standard deviation ellipses (SDE). The independent variables are split into two groups. Above (red) VS below (orange) 40 for the median age, above (red) VS below (orange) 3 for the average household size.

number of crimes VS median age

The plot above shows that younger people spread in the north and in the south of the city mostly, whereas older people spread along a west-east axis. Given that the number of crimes is heterogeneous along any of those axis, a conclusion is again hard to draw on the effect of age on crime rates. The plot below, illustrates the relationship between crimes and household size.

number of crimes VS average household size

Households, whether big or small, seem to be spread equally along the city, since both SDEs have similar shape.

Conclusion

The two hypotheses stated above can not be confirmed by the analysis above. It seems that the heterogeneity in crime data (as can be seen in the plots above) is not well modelled by either median age or average household size (except for the linear regression, for wich a higher average household size did involve more crimes). Other variables that are available in the crime and Zip datasets will be further investigated, in order to better model and predict crime rates.

R Script

knitr::opts_chunk$set(echo = F, message = F, warning = F)

packages <- c("rgdal", "foreign", "gdata", "ggmap", "ggplot2",
              "plyr", "rgeos", "sf", "ggrepel", "dplyr", "sp", "aspace",
              "spdep", "bookdown", "stringr", "maptools", "leaflet", "broom", "stargazer",
              "RColorBrewer")

package.check <- lapply(packages, FUN = function(x) {
  if (!require(x, character.only = T)) install.packages(x)
  if (! (x %in% (.packages() )))  library(x, character.only = T)
})


p <- read.csv("../Research Data/2010_Census_Populations_by_Zip_Code.csv")
load("../Research Data/crime.RData")
load(file = "../Research Data/zipCrimes.RData")

names(zc)[2] <- names(p)[1] <- "zip"

zCrimes <- zc
zCrimes@data <- merge(zc@data, p, by = "zip")
# zCrimes <- zCrimes[!is.na(zCrimes$crimes),]

# writeOGR(obj=zCrimes, driver="ESRI Shapefile", "../Research Data/zCrimes")



par(mfrow=c(1,3))
hist(zCrimes$crimes, xlab = "crimes since 2010", main=NULL)
hist(zCrimes$Average.Household.Size, xlab = "average household size in 2010", main=NULL)
hist(zCrimes$Median.Age, xlab = "median age in 2010", main=NULL)
par(mfrow=c(1,1))


par(mfrow=c(1,2))
plot(zCrimes$Median.Age, log(zCrimes$crimes), xlab = "median age", ylab = "ln(crimes)")
plot(zCrimes$Average.Household.Size, log(zCrimes$crimes), , xlab = "average household size", ylab = "ln(crimes)")
par(mfrow=c(1,1))


regHH <- lm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes)

stargazer(regHH, no.space = T, type = "html")

# regMedAge <- lm(crimes ~ Median.Age, data = zCrimes)
# 
# stargazer(regMedAge, no.space = T,  type = "html")


centroids <- as.data.frame(gCentroid(zCrimes,byid=TRUE))

# pal <- colorNumeric(
#   palette = "YlGnBu",
#   domain = zCrimes$crimes
# )

qpal <- colorQuantile("YlGnBu", zCrimes$crimes, n = 5)
qHH <- colorQuantile("YlOrRd", zCrimes$Average.Household.Size, n = 5)

leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    color = ~qpal(crimes),
    highlightOptions = highlightOptions(color = "white", weight = 2,
      bringToFront = TRUE)) %>% 
                addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qHH(Average.Household.Size),
                  radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>% 
                addTiles() %>%
    addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
    addLegend(pal = qHH, values = ~Average.Household.Size, opacity = 1)

qMA <- colorQuantile("YlOrRd", zCrimes$Median.Age, n = 5)


leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    color = ~qpal(crimes),
    highlightOptions = highlightOptions(color = "white", weight = 2,
      bringToFront = TRUE)) %>% 
                addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qMA(Median.Age),
                  radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>% 
                addTiles() %>%
    addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
    addLegend(pal = qMA, values = ~Median.Age, opacity = 1)


zCrimes@data$id = rownames(zCrimes@data)
crimePoints = fortify(zCrimes, region="id")
crimesDf = join(crimePoints, zCrimes@data, by="id")


# c$weapon <- !c$Weapon.Description %in% c("", "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)", "VERBAL THREAT")
# c$verbal <- c$Weapon.Description %in% "VERBAL THREAT"
# c$fists <- c$Weapon.Description %in% "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)"
# 
# clean_coords <- gsub(pattern = '[()]', replacement = '', x = c$Location)
# split_coords <- str_split(string = clean_coords, pattern = ', ', n = 2, simplify = T)
# c$lat <- as.numeric(split_coords[,1])
# c$lon <- as.numeric(split_coords[,2])


zCrimesLoc <- cbind(zCrimes@data, centroids)

ggplot(crimesDf) + 
  geom_polygon(aes(long,lat,group=group, fill = crimes)) +
  coord_equal() + 
  stat_ellipse(data = subset(zCrimesLoc, Median.Age > 40), aes(x = x, y = y), level=0.5, color = "red") +
  geom_point(data = subset(zCrimesLoc, Median.Age > 40), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
  stat_ellipse(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = x, y = y), level=0.5, color = "orange") +
  geom_point(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
  theme_void()


ggplot(crimesDf) + 
  geom_polygon(aes(long,lat,group=group, fill = crimes)) +
  coord_equal() + 
  stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = x, y = y), level=0.5, color = "red") +
  geom_point(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
  stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = x, y = y), level=0.5, color = "orange") +
  geom_point(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
  theme_void()

Spatial Statistics Research - Exploratory Spatial Data Analysis

Gabriel Benedict - gb2661@columbia.edu

2018-11-19