Concepts
Why they’re important
‘Concept Misinformation’
How to tackle the class
Agenda
What we’ll cover today
Organizing
Manipulating
Visualizing data
Before statistics & visualization…
Data must be ‘cleaned’
What to do with missing values?
How to address typos
Levels of measurement
Statistics involves decisions
Tidy Data
Normative
Our systematically collected and organized observations should be easy to work with
Characteristics
Intuitive column names
No typos
Easy to analyze
Idea to output gap is small
Technical
Our systematically collected and organized observations should have certain technical characteristics
Characteristics
Each column is one variable
Each row is one observation
Each cell is one value
Tidy Data
Variable: A characteristic across which individual observations vary in expression
Row: An observation, one row per student
Cell: A value for the observation for a variable
Example
What are my students’ favorite colors?
Variables
Student unique ID (student_id)
Favorite Color (favorite_color)
Tidy Data
student_id favorite_color
1 glue
2 NA
3 blurple
4 purple
5 ted
6 NA
The Garden of Forking Paths
Statistics is about making decisions
Data never speaks for itself
How do you deal with missing values?
How do you deal with strange values?
How does your analysis differ when you make different choices?
The Garden of Forking Paths
Easy choices to make
glue -> blue
ted -> red
Tougher choices to make
blurple -> blue
blurple -> purple
blurple -> student messing with me
I need to make a choice
The Garden of Forking Paths
Blurple Change the Blurples No change to Blurples Blurple -> Other Blurple -> Blue Blurple -> Purple Discard blurples Blurple means blurple
Implications
Change the value
Overcounting blues/purples
False sense of consolidation
Maybe blurple is a real color
Don’t change the value
What do ‘others’ have in common?
Discarding throws out information
Alternative Measurements
Was my measurement consonant with my concept?
Multiple choice?
Free-text question?
Our choices began before we started manipulating data
Statistics sans Numbers
Why not cover start with numbers?
Data isn’t given
It is produced
Choices were made in the process
The order matters
Concepts
Measurement
Manipulation
Visualization
Statistics sans Numbers
Statistics is a tool
Political Scientists aren’t second-rate statisticians. They’re social theorists who use statistisc
No Changes
student_id favorite_color
1 blue
2 NA
3 blurple
4 purple
5 red
6 NA
Ways to summarize: count each color
favorite_color count
blue 2
blurple 2
purple 1
red 3
NA 2
Tables vs Graphics
Reasons to use visualizations
I have the attention span of a goldfish
Tables are boring
Differences become apparent easily
Fewer cognitive resources necessary
People are familiar with the foundations of visualization
Elements of Graphics
Data: the underlying information. Graphics represent data
Aesthetics: The dimensions that represent variables and encode information e.g. x-axis, y-axis, color, etc.
Geoms: The shape we decide to use to represent our information
The Garden v2
Data x = color, y = count x = count, y = color geom_point geom_bar geom_point geom_bar
Summarize
Summary
Discussed Data
Its organization
Its manipulation
Its visualization
Choices are important!