Skip to content

Data cleaning

Data manipulation

Summarizing dataset

d // equivalent to `describe`
d, s // equivalent to `describe, short`

Listing variables

ds
d, si // equivalent to `describe, simple`

Printing

di // equivalent to `display`

Logical expressions

inrange(z, a, b)
inlist(z,a,b,...)

Filtering

# dplyr::filter()
df |> filter(age > 30 & city == "New York City")

# bracket subsetting
df[df$age > 30 & df$city == "New York City", ]

Keeping/Dropping variables

drop <varlist>
keep <varlist>

Grouping variables

bys <var>: <cmd>

Summarizing data

df |> 
  group_by(age) |>
  summarize(income = mean(income))
collapse

Working with Variables

Numerical variables

ceiling(x)
floor(x)
round(x, 3)     # number of decimal places
signif(x, 3)    # number of significant digits

String variables

activities <- c("running", "dancing", "reading")
pattern <- "read"
str_subset(activities, pattern)   # return strings that match this pattern
str_detect(activities, pattern)   # return a logical vector
str_which(activities, pattern)    # return indice(s)

string <- "Contact: string@gmail.com or character@gmail.com",
pattern <- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+"
str_extract()                     # extract the first match
str_extract_all()                 # extract all matches

Data and time variables

gen <var_clock> = clock(<var_str>, "hms") // e.g., "08:00:00"
format <var_clock> %tcHH:MM:SS

Data transformations

Normalization

TBD

Winsorization

winsor income, p(0.1) gen(income_w10)

Creating codebooks

TBD

Codes and identifiers

Geographic identifiers

FIPS codes:

Industry & occupation codes

Demographic codes

Program & policy identifiers

Survey crosswalks