June 26, 2018 From rOpenSci (https://deploy-preview-304--ropensci.netlify.app/blog/2018/06/26/roomba/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Data == knowledge! Much of the data we use, whether it be from
government repositories, social media, GitHub, or e-commerce sites comes
from public-facing APIs. The quantity of data available is truly
staggering, but munging JSON output into a format that is easily
analyzable in R is an equally staggering undertaking. When JSON is
turned into an R object, it usually becomes a deeply nested list riddled
with missing values that is difficult to untangle into a tidy format.
Moreover, every API presents its own challenges; code you’ve written to
clean up data from GitHub isn’t necessarily going to work on Twitter
data, as each API spews data out in its own unique, headache-inducing
nested list structure. To ease and generalize this process, Amanda
Dobbyn proposed an
unconf18 project for a general API response tidier! Welcome roomba
,
our first stab at easing the process of tidying nested lists!
roomba
will eventually be able to walk nested lists in a variety of
different structures from JSON output, replace NULL
or .empty
values
with NA
s or a user-specified value, and return a tibble
with names
matching a user-specified list. Of course, in two days we haven’t
fully achieved this vision, but we’re off to a promising start.
It was clear Amanda was on to something good by the lively discussion in
the #runconf18 issues
repository leading up to the unconf. Thanks to input from Jenny Bryan,
Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam
Ross, we had a lot of ideas to work with when the unconf began.
Fortunately, Jim already had a function called dfs_idx()
(here)
written to perform depth-first searches of nested lists from the GitNub
GraphQL API. With the core
list-traversal code out of the way, we split our efforts between
developing a usable interface, stockpiling .JSON
files to test on, and
developing a Shiny app.
We’ve got the basic structure of roomba
sorted out, and you should
install it from GitHub to try out! Here are a few of the examples we’ve
put together.
library(roomba)
#load twitter data example
data(twitter_data)
#roomba-fy!
roomba(twitter_data, c("created_at", "name"))
## # A tibble: 24 x 2
## name created_at
## <chr> <chr>
## 1 Code for America Mon Aug 10 18:59:29 +0000 2009
## 2 Ben Lorica <U+7F57><U+745E><U+5361> Mon Dec 22 22:06:18 +0000 2008
## 3 Dan Sholler Thu Apr 03 20:09:24 +0000 2014
## 4 Code for America Mon Aug 10 18:59:29 +0000 2009
## 5 FiveThirtyEight Tue Jan 21 21:39:32 +0000 2014
## 6 Digital Impact Wed Oct 07 21:10:53 +0000 2009
## 7 Drew Williams Thu Aug 07 18:41:29 +0000 2014
## 8 joe Fri May 29 13:25:25 +0000 2009
## 9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows
And just the first element of the twitter_data
list will show you
that roomba
has simplified this process quite a bit.
twitter_data[[1]]
## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
##
## $id
## [1] 9.98624e+17
##
## $id_str
## [1] "998623997397876743"
##
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl
https://t.co/7tVf1qmNmq"
##
## $truncated
## [1] TRUE
##
## $entities
## $entities$hashtags
## list()
##
## $entities$symbols
## list()
##
## $entities$user_mentions
## list()
##
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
##
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
##
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9
"
##
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
##
## $entities$urls[[1]]$indices[[2]]
## [1] 140
##
##
##
##
##
## $source
## [1] "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>"
##
## $in_reply_to_status_id
## NULL
##
## $in_reply_to_status_id_str
## NULL
##
## $in_reply_to_user_id
## NULL
##
## $in_reply_to_user_id_str
## NULL
##
## $in_reply_to_screen_name
## NULL
##
## $user
## $user$id
## [1] 64482503
##
## $user$id_str
## [1] "64482503"
##
## $user$name
## [1] "Code for America"
##
## $user$screen_name
## [1] "codeforamerica"
##
## $user$location
## [1] "San Francisco, California"
##
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
##
## $user$url
## [1] "https://t.co/l9lokka0rJ"
##
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
##
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
##
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
##
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
##
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
##
##
##
##
##
## $user$entities$description
## $user$entities$description$urls
## list()
##
##
##
## $user$protected
## [1] FALSE
##
## $user$followers_count
## [1] 49202
##
## $user$friends_count
## [1] 1716
##
## $user$listed_count
## [1] 2659
##
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
##
## $user$favourites_count
## [1] 4490
##
## $user$utc_offset
## [1] -25200
##
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
##
## $user$geo_enabled
## [1] TRUE
##
## $user$verified
## [1] TRUE
##
## $user$statuses_count
## [1] 15912
##
## $user$lang
## [1] "en"
##
## $user$contributors_enabled
## [1] FALSE
##
## $user$is_translator
## [1] FALSE
##
## $user$is_translation_enabled
## [1] FALSE
##
## $user$profile_background_color
## [1] "EBEBEB"
##
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
##
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
##
## $user$profile_background_tile
## [1] FALSE
##
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
##
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
##
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
##
## $user$profile_link_color
## [1] "CF1B41"
##
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
##
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
##
## $user$profile_text_color
## [1] "333333"
##
## $user$profile_use_background_image
## [1] FALSE
##
## $user$has_extended_profile
## [1] FALSE
##
## $user$default_profile
## [1] FALSE
##
## $user$default_profile_image
## [1] FALSE
##
## $user$following
## [1] TRUE
##
## $user$follow_request_sent
## [1] FALSE
##
## $user$notifications
## [1] FALSE
##
## $user$translator_type
## [1] "none"
##
##
## $geo
## NULL
##
## $coordinates
## NULL
##
## $place
## NULL
##
## $contributors
## NULL
##
## $is_quote_status
## [1] FALSE
##
## $retweet_count
## [1] 0
##
## $favorite_count
## [1] 0
##
## $favorited
## [1] FALSE
##
## $retweeted
## [1] FALSE
##
## $possibly_sensitive
## [1] FALSE
##
## $possibly_sensitive_appealable
## [1] FALSE
##
## $lang
## [1] "en"
We created a Shiny app too, which in its current state allows you to
select a .Rda
or .JSON
file, pick two variables, and create a
scatterplot of them.
Run the app like this:
shiny_roomba()
Of course, in two days we weren’t able to build a magical
one-size-fits-all solution to every API response data headache. Right
now, the main barrier to usability is that both the roomba()
function
and shiny_roomba()
app only work on sub-list items of the same length
and same data type stored at the same depth. To illustrate on the
twitter_data
:
#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))
## # A tibble: 1,007 x 1
## user
## <list>
## 1 <int [1]>
## 2 <chr [1]>
## 3 <chr [1]>
## 4 <chr [1]>
## 5 <chr [1]>
## 6 <chr [1]>
## 7 <chr [1]>
## 8 <list [2]>
## 9 <lgl [1]>
## 10 <int [1]>
## # ... with 997 more rows
#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))
## # A tibble: 0 x 0
In addition, we’ve got some features we want to add, such as handling a
larger variety of column names (i.e. passing a string for a single
column name, keeping all values even if they are all NULL
). We would
love your feedback on other things we can add (open an issue in our Git repository)!
Amanda Dobbyn
Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for
dfs_idx()
Jim Hester
Job: Software Engineer at RStudio
Project
contributions: dfs_idx()
and remove_nulls()
functions, package
building, README, and debugging
Christine Stawitz
Job: Postdoctoral researcher at University of Washington’s School of Aquatic and Fishery Sciences
Project contributions: Shiny app, README and blog post writing
Laura DeCicco
Job: Data Scientist at U.S. Geological Survey
Project contributions: Fixing merge conflicts :)
Isabella Velasquez
Job: Data Analyst at the Bill & Melinda Gates Foundation
Project contributions: hex sticker!