Searching for nulls across columns.

Data is never perfect, especially when transfering between systems. I was working with some Jira data and \N appeared in most of the columns upon extraction. So, I needed something that would loop through all columns and set those values to blank. Setting to NA would be easy to do also. Escape characters can be a nightmare.

jira.data <- as.data.frame(lapply(jira.data, function(x) 
  if(is.character(x)|is.factor(x)) 
    gsub("\\\\N", "", x) else x))

Cleaning up variable names.

Taking out underscores from column names.

names(jira.data) <- gsub("_", "", names(jira.data))

Dirty data vacuum cleaner

Sometimes you just have to bulldog the data and your code. Easier to make a list and add to it when necessary. I had a bunch of cases that I knew represented useless data. This code removes them by subsetting the data.

txt <- c('domo', 'modocorp',  'apptest', 'qa2stag', 'qastag', 'support-prod',  
          'support', '220221', '^ec-7', '^ec-8', '^ec-9', 'appdev', 'prod5-', 'dev-', 'publisher', 
          'qa-', 'ptk-', 'ckdemo', '^training', 'standard2-test1', 'demo', 'brian-prod', 'erictest', 
          '@', 'bohme', 'test', 'custom', 'dev-', 'dev.', 'http', 'sandbox', 'freemium', 'verizondemo')
 
data.frame <- data.frame[-grep(paste(txt, collapse = '|'), data.frame$var_name, ignore.case = TRUE),]

Non-ASCII characters get out!

I was cleaning up a data set imported from SQL. One of the columns contained the text of emails of which a few had Japanese characters. These cases were few and far between and not needed for my analysis. Here is an easy way to wipe them out.

text <- c("はしとみま", "とまきはと", "とまは", "しつくきは", "そみ", "hammer", "toe-jam")
text <- iconv(text, "latin1", "ASCII", sub="")
text
## [1] ""        ""        ""        ""        ""        "hammer"  "toe-jam"