class: center, middle, inverse, title-slide # Graphical Insights from Data:
Basics ### Heike Hofmann --- class: center, middle, inverse # Data Types, Formats, and Structures --- ## Data Before it is possible to talk about a graphical grammar, it is important to know the type and format of the data you’re working with. <br/> -- Why? -- - the data contains all of the information you’re trying to convey - the appropriate graphical techniques depend on the kind of data that you are working with - Working with R and ggplot is much easier if the data you use is in the right shape. --- ## Data: levels of measurement **Quantitative**: - Continuous (e.g. height, weight) - Discrete (e.g. age in years) **Qualitative**: - Nominal: categories have no meaningful order (e.g. colors) - Ordinal: categories have order but no meaningful distance between categories (e.g. five star ratings) --- ## Data: Dimensions, Form, and Type Dimensions | Forms | Types ----- | ---- | ---- Univariate (1 variable) | Traditional | Count (word freq, scores) Bivariate (2 variables) | Aggregated | Time Series Multivariate (3 variables) | | Spatial | | Time to Event (Survival, Reliability) | | Categorical --- ## Exploring Relationships Variables | Plot Type ----- | --- Continuous vs. Continuous | scatter plot, line plot Continuous vs. Categorical | boxplots, dotcharts, multiple density plots, violin plots Categorical vs. Categorical | mosaic plots, side-by-side barplots Multidimensional | it depends --- # Tidy Data  ggplot2 assumes your data is tidy. --- ## Untidy data *Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy* <br/> -- **Five main ways tables of data tend not to be tidy:** 1. Column headers are values, not variable names. 2. Multiple variables are stored in one column. 3. Variables are stored in both rows and columns. 4. Multiple types of observational units are stored in the same table. 5. A single observational unit is stored in multiple tables. --- ## Wide Format vs. Long Format .pull-left[ **Wide format** - some variables are spread out across columns. - typically uses less space to display - how you would typically choose to present your data - far less repetition of labels and row elements <img src="images/tablewide2.png" alt="A wide table example" width="500px"/> ] .pull-right[ **Long format** - each variable is a column - each observation is a row - is likely not the data's most compact form .center[<img src="images/tablelong2.png" alt="A long table example" height="300px"/>] ] --- ## Tidy Data - Use `tidyr` package (mainly `pivot_longer` and `pivot_wider`) to move between long and wide format - Tidyr Vignette: [pivoting](https://tidyr.tidyverse.org/articles/pivot.html) - Longer explanation and tutorial: [Pivoting data in R (and SAS)](https://srvanderplas.github.io/unl-stat850/transforming-data.html#pivot-operations) ??? We're not going to specifically cover how to move data from wide to long format, but I will use the pivot functions in some sample code in this workshop, and I've provided links to resources here. For now, what you need to know is more the picture-book version: wider and longer tables look different. ---  --- # ggplot2 basics Complete the template below to build a graph <img src="images/ggplot2-notation.png" alt="ggplot2 formula: ggplot(data = <dataset>) + <geom function>(aes(<mappings>), stat = <stat>, position = <position>)) + <coordinate function> + <scales function> + <facet function> + <theme>, where anything after the geom function statement is optional." style="width:70%"/> --- ## How to build a graph `ggplot(data = mpg, aes(x = cty, y = hwy))` - This will begin a plot that you can finish by adding layers to. - You can add one geom per layer <img src="01-ggplot-basics_files/figure-html/plots-4-1.png" /> --- ## What is a geom? In ggplot2, we use a geom function to represent data points, and use the geom's aesthetic properties to represent variables. <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-2-1.png" /> Once our data is formatted and we know what type of variables we are working with, we can select the correct geom for our visualization. ---  --- ## What is a layer? - it determines the physical representation of the data - Together, the data, mappings, statistical transformation, and geometric object form a layer - A plot may have multiple layers <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-4-1.png" /> --- class: inverse ## Your Turn Change the code below to have the points **on top** of the boxplots. ```r ggplot(data = mpg, aes(x = class, y = hwy)) + geom_jitter() + geom_boxplot() ``` <!-- --> --- ## Alternative method of building layers: Stats A stat builds a new variable to plot (e.g., count and proportion) <img src="images/stat1.png" width="48%" /><img src="images/stat2.png" width="48%" /> --- class:inverse ## Your turn Add a smooth line to the following scatterplot: ```r ggplot(data = mpg, aes(x = cty, y = hwy)) + geom_point() ``` <!-- --> --- ## Faceting A way to extract subsets of data and place them side-by-side in graphics ```r ggplot(data = mpg, aes(x = cty, y = hwy, colour = class)) + geom_point() ggplot(data = mpg, aes(x = cty, y = hwy, colour = class)) + geom_point() + facet_grid(.~class) ``` <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Faceting Options - `facet_grid(. ~ b)`:facet into columns based on b - `facet_grid(a ~ .)`:facet into columns based on a - `facet_grid(a ~ b)`:facet into both rows and columns - `facet_wrap( ~ fl)`:wrap facets into a rectangular layout You can set scales to let axis limits vary across facets: - `facet_grid(y ~ x, scales = "free")`: x and y axis limits adjust to individual facets - "free_x" - x axis limits adjust - "free_y" - y axis limits adjust --- class:inverse ## Your turn Find one or more sets of facets that are useful in understanding the relationship between city and highway mileage. ```r ggplot(data = mpg, aes(x = cty, y = hwy)) + geom_point() ``` <!-- --> --- ## Position Adjustments Position adjustments determine how to arrange geoms that would otherwise occupy the same space - **Dodge**: Arrange elements side by side - **Fill**: Stack elements on top of one another, normalize height - **Stack**: Stack elements on top of one another `ggplot(mpg, aes(fl, fill = drv)) + geom_bar(position = "")` <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Position Adjustments: Jitter - **Jitter**: Add random noise to X & Y position of each element to avoid overplotting - There is also a jitter geom <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Coordinate Systems - `coord_cartesian()`: The default cartesian coordinate system - `coord_fixed()`: Cartesian with fixed aspect ratio between x & y units - `coord_flip()`: Flipped Cartesian coordinates - `coord_polar()`: Polar coordinates - `coord_trans()`: Transformed cartesian coordinates. - `coord_map()`: Map projections from the mapproj package (mercator (default), azequalarea, lagrange, etc.) <img src="01-ggplot-basics_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## ggplot2 extensions - https://exts.ggplot2.tidyverse.org/ <iframe src="https://exts.ggplot2.tidyverse.org/gallery/" width="100%" height = "500px"/>