Skip to content

Data cleaning

Related pages: Existing datasets

Resources

§1. Dataframe manipulation

Listing variables

TBD

ds
d, si // equivalent to `describe, simple`

TBD

Printing and viewing

TBD

di // equivalent to `display`

TBD

Logical expressions

TBD

inrange(z, a, b)
inlist(z,a,b,...)

TBD

Filtering

# dplyr::filter()
df |> filter(age > 30 & city == "New York City")

# bracket subsetting
df[df$age > 30 & df$city == "New York City", ]

TBD

TBD

Creating new variables

mutate()

TBD

transform!(df, :x => mean => :x_mean)

Keeping, dropping, and ordering columns

TBD

drop <varlist>
keep <varlist>
select!(df, :var)

Grouping variables

bys <var>: <cmd>

Summarizing data

df |> 
  group_by(age) |>
  summarize(income = mean(income))
collapse

TBD

Joining dataframes

TBD

TBD

TBD

§2. Types of variables

Numerical variables

Rounding

ceiling(x)
floor(x)
round(x, 3)     # number of decimal places
signif(x, 3)    # number of significant digits

TBD

TBD

Winsorization and trimming

TBD

winsor income, p(0.1) gen(income_w10)

TBD

String variables

activities <- c("running", "dancing", "reading")
pattern <- "read"
str_subset(activities, pattern)   # return strings that match this pattern
str_detect(activities, pattern)   # return a logical vector
str_which(activities, pattern)    # return indice(s)

string <- "Contact: string@gmail.com or character@gmail.com",
pattern <- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+"
str_extract()                     # extract the first match
str_extract_all()                 # extract all matches

Factor variables

TBD

TBD

TBD

Data and time variables

gen <var_clock> = clock(<var_str>, "hms") // e.g., "08:00:00"
format <var_clock> %tcHH:MM:SS

§3. Codes and identifiers

Geographic identifiers

FIPS codes:

Industry codes

SIC:

Occupation codes

O*Net: