Data cleaning
Related pages: Existing datasets
Resources¶
§1. Dataframe manipulation¶
Listing variables¶
Printing and viewing¶
Logical expressions¶
Filtering¶
Creating new variables¶
Keeping, dropping, and ordering columns¶
Grouping variables¶
Summarizing data¶
Joining dataframes¶
TBD
TBD
TBD
§2. Types of variables¶
Numerical variables¶
Rounding¶
Winsorization and trimming¶
String variables¶
activities <- c("running", "dancing", "reading")
pattern <- "read"
str_subset(activities, pattern) # return strings that match this pattern
str_detect(activities, pattern) # return a logical vector
str_which(activities, pattern) # return indice(s)
string <- "Contact: string@gmail.com or character@gmail.com",
pattern <- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+"
str_extract() # extract the first match
str_extract_all() # extract all matches
Factor variables¶
TBD
TBD
TBD
Data and time variables¶
§3. Codes and identifiers¶
Geographic identifiers¶
FIPS codes:
- Federal Information Processing System (FIPS) Codes for States and Counties
- Federal Information Processing Standard state code - Wikipedia
Industry codes¶
SIC:
Occupation codes¶
O*Net: