Use glom
Dictionaries are really cool. They’re very powerful and a large part
of Python: so much stuff is built on the dict
type. They’re generally
quite easy to work with, but what about when you have nested dictionaries
that contain lists that contain dictionaries that contain certain keys
only in some cases and sometimes there’s another list involved?
Use glom.
Elasticsearch Bulk API Responses
The Elasticsearch Bulk API is an efficient way to perform actions on multiple documents instead of using multiple calls.
Because it’s doing more than one action, the response from this call is
a bit complex. If the request body was something that could be processed at all,
the HTTP response code will be 200. Beyond that, you need to look at the JSON
response for an errors
key at the top level. This lets you know if any of the
individual items
were unsuccesful.
That items
list is where things can get tricky. Here’s a stripped down example of
a 200 response that includes one error and one success for two documents we sent
with the index
action:
{"errors": true, "items": [ {"index": {"_id": "F6bbqHIBgo1082mzZuO3", "error": { "caused_by": { "reason": "For input string: '2020-06-12T14:07:30.452649+00:00'", "type": "illegal_argument_exception"}, "reason": "failed to parse [usage.range.gte]", "type": "mapper_parsing_exception"}, "status": 400}}, {"index": {"_id": "GKbbqHIBgo1082mzZuO3", "result": "created", "status": 201}}]}
What I want to do with this response is turn it into a dict
where the
_id
s are keys and their values are a dictionary of consistent keys
and values regardless of the success of the action. This way I can more easily
do something with the failures and move on from the successes. That’s kind of tricky!
Each index
includes the _id
that we’ll need and a status
,
but beyond that it depends on how the action fared as to what else it contains.
My plan is to make all _id
keys in my dict
include an error
which
will either be None
if it worked, or the relevant error
details if it didn’t.
That gives me one place to look to know success or failure for each item.
Write it ourselves
This isn’t initially that hard to solve. If error
exists we use it,
if not we use the default None
from dict.get
.
def parse_body(body): if (items := body.get("items")) is None: raise Exception("No items in this response") result = {} for item in items: index = item["index"] result[index["_id"]] = { "error": index.get("error"), # None by default "status": index["status"], } return result
…which returns:
{'F6bbqHIBgo1082mzZuO3': {'error': {'caused_by': {'reason': 'For input string: "2020-06-12T14:07:30.452649+00:00"', 'type': 'illegal_argument_exception'}, 'reason': 'failed to parse [usage.range.gte]', 'type': 'mapper_parsing_exception'}, 'status': 400}, 'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}
That’s relatively straightforward, but it kicks the problem down the road.
In the error
case now we have another nested dictionary which I want to
flatten out. What I really want are the two reason
values so I can log
them and more clearly point out what’s going wrong.
Write it ourselves, but better
To get a flatter error
, something like this could do it:
def parse_body(body): if (items := body.get("items")) is None: raise Exception("No items in this response") result = {} for item in items: index = item["index"] details = {"status": index["status"]} if (error := index.get("error")) is None: details["error"] = None else: details["error"] = { "reason": error["reason"], "cause": error["caused_by"]["reason"] } result[index["_id"]] = details return result
…which returns:
{'F6bbqHIBgo1082mzZuO3': {'error': {'cause': 'For input string: "2020-06-12T14:07:30.452649+00:00"', 'reason': 'failed to parse [usage.range.gte]'}, 'status': 400}, 'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}
That’s much better! However, this is quickly becoming more complex. We still need to test this, and between the first and second versions we added more branches in the code that we’ll need to cover.
The cyclomatic complexity
of our new approach went from 3 to 4 as measured by the
mccabe library, named for Thomas McCabe,
who coined the metric. Metrics aside, we can see this code is growing more if
s
and loops and indexing the more we add to it, and we’re making a few assumptions
that we won’t end up with a KeyError
on any of those lookups.
Charlie Kelly designing our third attempt at writing this.
Use glom
glom is a library for “Restructuring data, the Python way.” It was made to solve our problem.
Here’s what a solution that meets our needs looks like using glom
.
It returns the exact same dict
as the second parse_body
function.
There’s a lot to unpack here in the glom “spec”, and I’ll walk through it below. glom has an excellent tutorial that can explain it all better than me—and it has a browser-based REPL!—and it’s how I figured a lot of this out. The rest of their docs are well written and comprehensive, so check them out.
glom.glom
takes atarget
nested object and aspec
to transform it. Everything we want is under the"items"
key, so that’s thepath
part of ourspec
.Nested under
body["items"]
is a list ofdict
s, all with an"index"
key. Line 6 is a sub-path that tells glom to produce an iterable of the contents of each"index"
withinbody["items"]
-
Lines 7–20 are a sub-path that tells glom to produce an iterable of a dictionary comprehension where the key is the
"_id"
of each"index"
target dictionary—glom.T
accesses the target path—and the value is a dictionary with"status"
and"error"
keys.The
"status"
comes directly from the"status"
in the"index"
target."error"
is more involved and where we start to restructure things.
We decided earlier that we want
"error"
in any case, usingNone
as a signal that there’s not actually an error.glom.Coalesce
to the rescue on Lines 11–17. If it can’t create something out of the sub-spec we passed in, thedefault=None
will become the value.For our
"reason"
we want to take the first-level["error"]["reason"]
from the target"index"
dictionary.For our
"cause"
we want to take the["caused_by"]["reason"]
that is nested within the["error"]
in the target.
On Line 21 we use
glom.Merge()
which combines all of the priorIter
specs together into one resulting object.
While it might look intimdating at first, it’s wildly powerful and this example barely scraches the surface of its capabilities. On top of that, when you consider the functional difference in our two hand-made implementations, the difference between glom implementations to produce the same result is smaller and no more complex.
To come back to cyclomatic complexity, the glom impementation of parse_body
checks in at 1. To a caller it has no branches, no loops, none of that. It’s a function
and it returns a dictionary. That’s not to say it’s not a complex piece of software,
but that it takes care of the complexity for you.
Testing our manual versions of this might require a bunch of test cases to ensure we’re covering all of those branches, and will probably require we do something different about those dictionary lookups. Testing our glom implementation requires passing in a body that includes both cases we’re looking at—error and success—and seeing that we get a good result. I’m very picky about dependencies, but as of this writing glom has 97% test coverage and great documentation, so I’m comfortable letting it do the work for us.
Conclusion
Use glom.
Thank you Mahmoud Hashemi for creating this wonderful piece of software, and thanks as well to anyone else who’s contributed to it.
Check out the source at https://github.com/mahmoud/glom — it’s a very well done project.