Use rjsoncons for querying, transforming, and searching JSON, NDJSON, or R objects using JMESpath, JSONpath, or JSONpointer. rjsoncons supports JSON patch for document editing, and JSON schema validation. Link to the package for direct access to additional features in the jsoncons C++ library.
Install the released package version from CRAN
Install the development version with
if (!requireNamespace("remotes", quiety = TRUE))
install.packages("remotes", repos = "https://CRAN.R-project.org")
remotes::install_github("mtmorgan/rjsoncons")
Attach the installed package to your R session, and check the version of the C++ library in use
Functions in this package work on JSON or NDJSON character vectors, file paths and URLs to JSON or NDJSON documents, and R objects that can be transformed to a JSON string.
j_query()
Here is a simple JSON example document
json <- '{
"locations": [
{"name": "Seattle", "state": "WA"},
{"name": "New York", "state": "NY"},
{"name": "Bellevue", "state": "WA"},
{"name": "Olympia", "state": "WA"}
]
}'
There are several common use cases. Use rjsoncons to query the JSON string using JSONpath, JMESpath or JSONpointer syntax to filter larger documents to records of interest, e.g., only cities in New York state, using ‘JMESpath’ syntax.
Use the as = "R"
argument to extract deeply nested
elements as R objects, e.g., a character vector of city names
in Washington state.
The JSON Pointer specification is simpler, indexing a single object in the document. JSON arrays are 0-based.
The examples above use j_query()
, which automatically
infers query specification from the form of path
using
j_path_type()
. It may be useful to indicate query
specification more explicitly using jsonpointer()
,
jsonpath()
, or jmespath()
; examples
illustrating features available for each query specification are on the
help pages ?jsonpointer
, ?jsonpath
, and
?jmespath
.
j_pivot()
The following transforms a nested JSON document into a format that
can be incorporated directly in R as a
data.frame
.
path <- '{
name: locations[].name,
state: locations[].state
}'
j_query(json, path, as = "R") |>
data.frame()
## name state
## 1 Seattle WA
## 2 New York NY
## 3 Bellevue WA
## 4 Olympia WA
The transformation from JSON ‘array-of-objects’ to ‘object-of-arrays’
suitable for direct representation as a data.frame
is
common, and is implemented directly as j_pivot()
j_pivot(json, "locations", as = "data.frame")
## name state
## 1 Seattle WA
## 2 New York NY
## 3 Bellevue WA
## 4 Olympia WA
j_pivot()
also support as = "tibble"
when
the dplyr package
is installed.
rjsoncons supports NDJSON (new-line delimited JSON). NDJSON consists of a file or character vector where each line / element represents a JSON record. This example uses data from the GitHub Archive project recording all actions on public GitHub repositories. The data included in the package are the first 10 lines of https://data.gharchive.org/2023-02-08-0.json.gz.
NDJSON can be read into R
(ndjson <- readLines(ndjson_file)
) and used in
j_query()
/ j_pivot()
, but it is often better
to leave full NDJSON files on disk. Thus the first argument to
j_query()
or j_pivot()
is usually a (text or
gz-compressed) file path or URL. Two additional options are available
when working with NDJSON. n_records
limits the number of
records processed. Using n_records
can be very useful when
exploring the data. For instance, the first record of a file can be
viewed interactively with
The option verbose = TRUE
adds a progress indicator,
which provides confidence that progress is being made while parsing
large files. The progress bar requires the cli package.
j_query()
provides a one-to-one mapping of NDJSON lines
/ elements to the return value, e.g.,
j_query(ndjson_file, "@", as = "string")
on an NDJSON file
with 1000 lines will return a character vector of 1000 elements, or with
j_query(ndjson, "@", as = "R")
an R list with
length 1000.
j_query(ndjson_file, "{id: id, type: type}", n_records = 5)
## [1] "{\"id\":\"26939254345\",\"type\":\"DeleteEvent\"}"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\"}"
## [3] "{\"id\":\"26939254361\",\"type\":\"CreateEvent\"}"
## [4] "{\"id\":\"26939254365\",\"type\":\"CreateEvent\"}"
## [5] "{\"id\":\"26939254366\",\"type\":\"PushEvent\"}"
j_pivot()
transforms an NDJSON file or character vector
of objects into a format convenient for input in R.
j_pivot()
with NDJSON files and JMESpath paths work
particularly well together, because JMESpath provides flexibility in
creating JSON objects to be pivoted.
j_pivot(ndjson_file, "{id: id, type: type}", as = "data.frame")
## id type
## 1 26939254345 DeleteEvent
## 2 26939254358 PushEvent
## 3 26939254361 CreateEvent
## 4 26939254365 CreateEvent
## 5 26939254366 PushEvent
## 6 26939254367 PushEvent
## 7 26939254379 PushEvent
## 8 26939254380 IssuesEvent
## 9 26939254382 PushEvent
## 10 26939254383 PushEvent
Filtering NDJSON files can require relatively more complicated paths,
e.g., to filter ‘PushEvent’ types from organizations, construct a query
that acts on each NDJSON record to return an array of a single object,
then apply a filter to replace uninteresting elements with 0-length
arrays (using as = "tibble"
often transforms the R
list-of-vectors to a tibble in a more pleasing and robust manner
compared to as = "data.frame"
).
path <-
"[{id: id, type: type, org: org}]
[[email protected] == 'PushEvent' && @.org != null] |
[0]"
j_pivot(ndjson_file, path, as = "data.frame")
## id type org.id org.login org.gravatar_id
## 1 26939254358 PushEvent 123667276 johnbieren-testing
## 2 26939254382 PushEvent 123667276 johnbieren-testing
## org.url
## 1 https://api.github.com/orgs/johnbieren-testing
## 2 https://api.github.com/orgs/johnbieren-testing
## org.avatar_url org.id.1 org.login.1
## 1 https://avatars.githubusercontent.com/u/123667276? 120284018 mornystannit
## 2 https://avatars.githubusercontent.com/u/123667276? 120284018 mornystannit
## org.gravatar_id.1 org.url.1
## 1 https://api.github.com/orgs/mornystannit
## 2 https://api.github.com/orgs/mornystannit
## org.avatar_url.1
## 1 https://avatars.githubusercontent.com/u/120284018?
## 2 https://avatars.githubusercontent.com/u/120284018?
A more complete example is used in the NDJSON extended vignette
rjsoncons can
filter and transform R objects. These are converted to JSON
using jsonlite::toJSON()
before queries are made;
toJSON()
arguments like auto_unbox = TRUE
can
be added to the function call.
JSON Patch provides a simple way to edit or transform a JSON document using JSON commands.
j_patch_apply()
Starting with the JSON document
one can "add"
another biscuit, and copy a favorite
biscuit to a new locations using the following patch
patch <- '[
{"op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" }},
{"op": "copy", "from": "/biscuits/2", "path": "/best_biscuit"}
]'
The paths are specified using JSONpointer notation; remember that JSON arrays are 0-based, compared to 1-based R arrays. Applying the patch results in a new JSON document.
j_patch_apply(json, patch)
## [1] "{\"biscuits\":[{\"name\":\"Digestive\"},{\"name\":\"Ginger Nut\"},{\"name\":\"Choco Leibniz\"}],\"best_biscuit\":{\"name\":\"Choco Leibniz\"}}"
Patches can also be created from R objects with the helper
function j_patch_op()
.
ops <- c(
j_patch_op(
"add", "/biscuits/1", value = list(name = "Ginger Nut"),
auto_unbox = TRUE
),
j_patch_op("copy", "/best_biscuit", from = "/biscuits/2")
)
identical(j_patch_apply(json, patch), j_patch_apply(json, ops))
## [1] TRUE
j_patch_op()
takes care of unboxing op=
,
path=
, and from=
, but some care must be taken
in ‘unboxing’ the value=
argument for operations such as
‘add’; it may also be appropriate to unbox only specific fields,
e.g.,
value <- list(name = jsonlite::unbox("Ginger Nut"))
j_patch_op("add", "/biscuits/1", value = value)
## [
## {"op": "add", "path": "/biscuits/1", "value": {"name": "Ginger Nut"}}
## ]
From the JSON patch web site, available operations and example JSON are:
add
– add elements to an existing document.
{"op": "add", "path": "/biscuits/1", "value": {"name": "Ginger Nut"}}
remove
– remove elements from a document.
{"op": "remove", "path": "/biscuits/0"}
replace
– replace one element with another
{
"op": "replace", "path": "/biscuits/0/name",
"value": "Chocolate Digestive"
}
copy
– copy a path to another location.
{"op": "copy", "from": "/biscuits/0", "path": "/best_biscuit"}
move
– move a path to another location.
{"op": "move", "from": "/biscuits", "path": "/cookies"}
test
– test for the existence of a path; if the path
does not exist, do not apply any of the patch.
{"op": "test", "path": "/best_biscuit/name", "value": "Choco Leibniz"}
Formal description of these operations is provided in Section 4 of RFC6902. A patch command is always an array, even when a single operation is involved.
j_patch_from()
The j_patch_from()
function constructs a patch from the
difference between two documents
JSON schema provides structure
to JSON documents. j_schema_is_valid()
checks that a JSON
document is valid against a specified schema, and
j_schema_validate()
tries to illustrate how a document
deviates from the schema.
As an example consider j_patch_op()
, where the operation
is supposed to conform to the JSON
patch schema. For convenience, a copy of this schema is available in
rjsoncons.
## alternatively: schema <- "https://json.schemastore.org/json-patch"
schema <- system.file(package = "rjsoncons", "extdata", "json-patch.json")
cat(readLines(schema), sep = "\n")
## {
## "$schema": "http://json-schema.org/draft-04/schema#",
## "definitions": {
## "path": {
## "description": "A JSON Pointer path.",
## "type": "string"
## }
## },
## "id": "https://json.schemastore.org/json-patch.json",
## "items": {
## "oneOf": [
## {
## "additionalProperties": false,
## "required": ["value", "op", "path"],
## "properties": {
## "path": {
## "$ref": "#/definitions/path"
## },
## "op": {
## "description": "The operation to perform.",
## "type": "string",
## "enum": ["add", "replace", "test"]
## },
## "value": {
## "description": "The value to add, replace or test."
## }
## }
## },
## {
## "additionalProperties": false,
## "required": ["op", "path"],
## "properties": {
## "path": {
## "$ref": "#/definitions/path"
## },
## "op": {
## "description": "The operation to perform.",
## "type": "string",
## "enum": ["remove"]
## }
## }
## },
## {
## "additionalProperties": false,
## "required": ["from", "op", "path"],
## "properties": {
## "path": {
## "$ref": "#/definitions/path"
## },
## "op": {
## "description": "The operation to perform.",
## "type": "string",
## "enum": ["move", "copy"]
## },
## "from": {
## "$ref": "#/definitions/path",
## "description": "A JSON Pointer path pointing to the location to move/copy from."
## }
## }
## }
## ]
## },
## "title": "JSON schema for JSONPatch files",
## "type": "array"
## }
The well-formed ‘op’ is valid, and j_schema_validate()
produces no output
op <- '[{
"op": "add", "path": "/biscuits/1",
"value": { "name": "Ginger Nut" }
}]'
j_schema_is_valid(op, schema)
## [1] TRUE
j_schema_validate(op, schema)
## [1] "[]"
Introduce an invalid ‘op’, "op": "invalid_op"
, and the
schema is no longer valid.
op <- '[{
"op": "invalid_op", "path": "/biscuits/1",
"value": { "name": "Ginger Nut" }
}]'
j_schema_is_valid(op, schema)
## [1] FALSE
The reason can be understood from (careful!) consideration of the
output of j_schema_validate()
, with reference to the schema
itself.
j_schema_validate(op, schema, as = "tibble") |>
tibble::glimpse()
## Rows: 1
## Columns: 6
## $ valid <lgl> FALSE
## $ evaluationPath <chr> "/items/oneOf"
## $ schemaLocation <chr> "https://json.schemastore.org/json-patch.json#/items/…
## $ instanceLocation <chr> "/0"
## $ error <chr> "No schema matched, but exactly one of them is requir…
## $ details <list> [[FALSE, "/items/oneOf/0/properties/op/enum", "https:…
The validation indicates that the schema evaluationPath
‘/items/oneOf’ is not satisfied, because of the error
‘No
schema [i.e., ’oneOf’ elements] matched, …’.
The ‘details’ column summarizes why each of the 3 elements of
/items/oneOf
fails the schema specification; use
as = "details"
to extract this directly
j_schema_validate(op, schema, as = "details") |>
tibble::glimpse()
## Rows: 6
## Columns: 5
## $ valid <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
## $ evaluationPath <chr> "/items/oneOf/0/properties/op/enum", "/items/oneOf/1/…
## $ schemaLocation <chr> "https://json.schemastore.org/json-patch.json#/items/…
## $ instanceLocation <chr> "/0/op", "/0/op", "/0/value", "/0", "/0/op", "/0/valu…
## $ error <chr> "'invalid_op' is not a valid enum value.", "'invalid_…
This indicates that the first item in the schema is rejected because ‘invalid_op’ is not a valid enum
Reasons for rejecting other items can be explored using similar steps.
It can sometimes be helpful to explore JSON documents by ‘flattening’ the JSON to an object of path / value pairs, where the path is the JSONpointer path to the corresponding value. It is then straight-forward to search this flattened object for, e.g., the path to a known field or value. As an example, consider the object
codes <- '{
"discards": {
"1000": "Record does not exist",
"1004": "Queue limit exceeded",
"1010": "Discarding timed-out partial msg"
},
"warnings": {
"0": "Phone number missing country code",
"1": "State code missing",
"2": "Zip code missing"
}
}'
The ‘flat’ JSON of this can be represented as named list (using
str()
to provide a compact visual representation)
j_flatten(codes, as = "R") |>
str()
## List of 6
## $ /discards/1000: chr "Record does not exist"
## $ /discards/1004: chr "Queue limit exceeded"
## $ /discards/1010: chr "Discarding timed-out partial msg"
## $ /warnings/0 : chr "Phone number missing country code"
## $ /warnings/1 : chr "State code missing"
## $ /warnings/2 : chr "Zip code missing"
The names of the list are JSONpointer (default) or JSONpath, so can
be used in j_query()
and j_pivot()
as
appropriate
There are two ways to find known keys and values. The first is to use exact matching to one or more keys or values, e.g.,
j_find_values(
codes, c("Record does not exist", "State code missing"),
as = "tibble"
)
## # A tibble: 2 × 2
## path value
## <chr> <chr>
## 1 /discards/1000 Record does not exist
## 2 /warnings/1 State code missing
j_find_keys(codes, "warnings", as = "tibble")
## # A tibble: 3 × 2
## path value
## <chr> <chr>
## 1 /warnings/0 Phone number missing country code
## 2 /warnings/1 State code missing
## 3 /warnings/2 Zip code missing
It is also possible to match using a regular expression.
j_find_values_grep(codes, "missing", as = "tibble")
## # A tibble: 3 × 2
## path value
## <chr> <chr>
## 1 /warnings/0 Phone number missing country code
## 2 /warnings/1 State code missing
## 3 /warnings/2 Zip code missing
j_find_keys_grep(codes, "card.*/100", as = "tibble") # span key delimiters
## # A tibble: 2 × 2
## path value
## <chr> <chr>
## 1 /discards/1000 Record does not exist
## 2 /discards/1004 Queue limit exceeded
Keys are always character vectors, but values can be of different
type; j_find_values()
supports searches on these.
j <- '{"x":[1,[2, 3]],"y":{"a":4}}'
j_flatten(j, as = "R") |> str()
## List of 4
## $ /x/0 : int 1
## $ /x/1/0: int 2
## $ /x/1/1: int 3
## $ /y/a : int 4
j_find_values(j, c(2, 4), as = "tibble")
## # A tibble: 2 × 2
## path value
## <chr> <int>
## 1 /x/1/0 2
## 2 /y/a 4
A common operation might be to find the path to a know value, and then to query the original JSON to find the object in which the value is contained.
j_find_values(j, 3, as = "tibble")
## # A tibble: 1 × 2
## path value
## <chr> <int>
## 1 /x/1/1 3
## path to '3' is '/x/1/1', so containing object is at '/x/1'
j_query(j, "/x/1")
## [1] "[2,3]"
j_query(j, "/x/1", as = "R")
## [1] 2 3
Both JSONpointer and JSONpath are supported; an advantage of the latter is that the path distinguishes between integer-valued (unquoted) and string-valued (quoted) keys
j_find_values(j, 3, as = "tibble", path_type = "JSONpath")
## # A tibble: 1 × 2
## path value
## <chr> <int>
## 1 $['x'][1][1] 3
The first argument to j_find_*()
can be an R
object, JSON or NDJSON string, file, or URL. Using
j_find_values()
with an R object and JSONpath
path_type
leads to a path that is easily converted into an
R index: double the [
and ]
in the
path and increment each numerical index by 1:
l <- j |> as_r()
j_find_values(l, 3, auto_unbox = TRUE, path_type = "JSONpath", as = "tibble")
## # A tibble: 1 × 2
## path value
## <chr> <int>
## 1 $['x'][1][1] 3
l[['x']][[2]] # siblings
## [1] 2 3
NDJSON files are flattened into character vectors, with each element the flattened version of the corresponding NDJSON record.
The package includes a JSON parser, used with the argument
as = "R"
or directly with as_r()
The main rules of this transformation are outlined here. JSON arrays of a single type (boolean, integer, double, string) are transformed to R vectors of the same length and corresponding type.
as_r('[true, false, true]') # boolean -> logical
## [1] TRUE FALSE TRUE
as_r('[1, 2, 3]') # integer -> integer
## [1] 1 2 3
as_r('[1.0, 2.0, 3.0]') # double -> numeric
## [1] 1 2 3
as_r('["a", "b", "c"]') # string -> character
## [1] "a" "b" "c"
JSON arrays mixing integer and double values are transformed to R numeric vectors.
If a JSON integer array contains a value larger than R’s
32-bit integer representation, the array is transformed to an R
numeric vector. NOTE that this results in loss of precision for JSON
integer values greater than 2^53
.
JSON objects are transformed to R named lists.
as_r('{}')
## named list()
as_r('{"a": 1.0, "b": [2, 3, 4]}') |> str()
## List of 2
## $ a: num 1
## $ b: int [1:3] 2 3 4
There are several additional details. A JSON scalar and a JSON vector of length 1 are represented in the same way in R.
JSON arrays mixing types other than integer and double are transformed to R lists
JSON null
values are represented as R
NULL
values; arrays of null
are transformed to
lists
as_r('null') # NULL
## NULL
as_r('[null]') |> str() # list(NULL)
## List of 1
## $ : NULL
as_r('[null, null]') |> str() # list(NULL, NULL)
## List of 2
## $ : NULL
## $ : NULL
Ordering of object members is controlled by the
object_names=
argument. The default preserves names as they
appear in the JSON definition; use "sort"
to sort names
alphabetically. This argument is applied recursively.
json <- '{"b": 1, "a": {"d": 2, "c": 3}}'
as_r(json) |> str()
## List of 2
## $ b: int 1
## $ a:List of 2
## ..$ d: int 2
## ..$ c: int 3
as_r(json, object_names = "sort") |> str()
## List of 2
## $ a:List of 2
## ..$ c: int 3
## ..$ d: int 2
## $ b: int 1
The parser corresponds approximately to
jsonlite::fromJSON()
with arguments
simplifyVector = TRUE, simplifyDataFrame = FALSE, simplifyMatrix = FALSE)
.
Unit tests (using the tinytest
framework) providing additional details are available at
jsonlite::fromJSON()
The built-in parser can be replaced by alternative parsers by
returning the query as a JSON string, e.g., using the
fromJSON()
in the jsonlite
package.
json <- '{
"locations": [
{"name": "Seattle", "state": "WA"},
{"name": "New York", "state": "NY"},
{"name": "Bellevue", "state": "WA"},
{"name": "Olympia", "state": "WA"}
]
}'
j_query(json, "locations[?state == 'WA']") |>
## `fromJSON()` simplifies list-of-objects to data.frame
jsonlite::fromJSON()
## name state
## 1 Seattle WA
## 2 Bellevue WA
## 3 Olympia WA
The rjsoncons
package is particularly useful when accessing elements that might
otherwise require complicated application of nested
lapply()
, purrr expressions,
or tidyr
unnest_*()
(see R for
Data Science chapter ‘Hierarchical data’).
The package includes the complete ‘jsoncons’ C++ header-only library, available to other R packages by adding
LinkingTo: rjsoncons
SystemRequirements: C++11
to the DESCRIPTION file. Typical use in an R package would also
include LinkingTo:
specifications for the cpp11 or Rcpp (this package
uses cpp11)
packages to provide a C / C++ interface between R and the C++ ‘jsoncons’
library.
This vignette was compiled using the following software versions
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rjsoncons_1.3.1.9100 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.3 knitr_1.49
## [4] rlang_1.1.4 xfun_0.49 jsonlite_1.8.9
## [7] glue_1.8.0 buildtools_1.0.0 htmltools_0.5.8.1
## [10] maketools_1.3.1 sys_3.4.3 sass_0.4.9
## [13] fansi_1.0.6 rmarkdown_2.29 evaluate_1.0.1
## [16] jquerylib_0.1.4 tibble_3.2.1 fastmap_1.2.0
## [19] yaml_2.3.10 lifecycle_1.0.4 BiocManager_1.30.25
## [22] compiler_4.4.2 pkgconfig_2.0.3 digest_0.6.37
## [25] R6_2.5.1 utf8_1.2.4 pillar_1.9.0
## [28] magrittr_2.0.3 bslib_0.8.0 tools_4.4.2
## [31] cachem_1.1.0