Use glom

Elasticsearch Bulk API Responses

The Elasticsearch Bulk API is an efficient way to perform actions on multiple documents instead of using multiple calls.

Because it’s doing more than one action, the response from this call is a bit complex. If the request body was something that could be processed at all, the HTTP response code will be 200. Beyond that, you need to look at the JSON response for an errors key at the top level. This lets you know if any of the individual items were unsuccesful.

That items list is where things can get tricky. Here’s a stripped down example of a 200 response that includes one error and one success for two documents we sent with the index action:

{"errors": true,
 "items": [
   {"index": {"_id": "F6bbqHIBgo1082mzZuO3",
              "error": {
                "caused_by": {
                  "reason": "For input string: '2020-06-12T14:07:30.452649+00:00'",
                  "type": "illegal_argument_exception"},
                "reason": "failed to parse [usage.range.gte]",
                "type": "mapper_parsing_exception"},
              "status": 400}},
   {"index": {"_id": "GKbbqHIBgo1082mzZuO3",
              "result": "created",
              "status": 201}}]}

What I want to do with this response is turn it into a dict where the _ids are keys and their values are a dictionary of consistent keys and values regardless of the success of the action. This way I can more easily do something with the failures and move on from the successes. That’s kind of tricky!

Each index includes the _id that we’ll need and a status, but beyond that it depends on how the action fared as to what else it contains. My plan is to make all _id keys in my dict include an error which will either be None if it worked, or the relevant error details if it didn’t. That gives me one place to look to know success or failure for each item.

Write it ourselves

This isn’t initially that hard to solve. If error exists we use it, if not we use the default None from dict.get.

def parse_body(body):
    if (items := body.get("items")) is None:
        raise Exception("No items in this response")

    result = {}

    for item in items:
        index = item["index"]
        result[index["_id"]] = {
            "error": index.get("error"),  # None by default
            "status": index["status"],
        }

    return result

…which returns:

{'F6bbqHIBgo1082mzZuO3':
  {'error':
    {'caused_by':
      {'reason': 'For input string: "2020-06-12T14:07:30.452649+00:00"',
       'type': 'illegal_argument_exception'},
     'reason': 'failed to parse [usage.range.gte]',
     'type': 'mapper_parsing_exception'},
   'status': 400},
'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}

That’s relatively straightforward, but it kicks the problem down the road. In the error case now we have another nested dictionary which I want to flatten out. What I really want are the two reason values so I can log them and more clearly point out what’s going wrong.

Write it ourselves, but better

To get a flatter error, something like this could do it:

def parse_body(body):
    if (items := body.get("items")) is None:
        raise Exception("No items in this response")

    result = {}

    for item in items:
        index = item["index"]
        details = {"status": index["status"]}

        if (error := index.get("error")) is None:
            details["error"] = None
        else:
            details["error"] = {
                "reason": error["reason"],
                "cause": error["caused_by"]["reason"]
            }

        result[index["_id"]] = details

    return result

…which returns:

{'F6bbqHIBgo1082mzZuO3':
  {'error': {'cause': 'For input string: "2020-06-12T14:07:30.452649+00:00"',
             'reason': 'failed to parse [usage.range.gte]'},
   'status': 400},
'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}

That’s much better! However, this is quickly becoming more complex. We still need to test this, and between the first and second versions we added more branches in the code that we’ll need to cover.

The cyclomatic complexity of our new approach went from 3 to 4 as measured by the mccabe library, named for Thomas McCabe, who coined the metric. Metrics aside, we can see this code is growing more ifs and loops and indexing the more we add to it, and we’re making a few assumptions that we won’t end up with a KeyError on any of those lookups.

Charlie Kelly designing our third attempt at writing this.

Use glom

glom is a library for “Restructuring data, the Python way.” It was made to solve our problem.

Here’s what a solution that meets our needs looks like using glom. It returns the exact same dict as the second parse_body function.

	`def parse_body(body):`
	`return glom.glom(`
	`body,`
	`(`
	`"items",`
	`glom.Iter(glom.T["index"]),`
	`glom.Iter(`
	`{`
	`glom.T["_id"]: {`
	`"status": glom.T["status"],`
	`"error": glom.Coalesce(`
	`{`
	`"reason": glom.T["error"]["reason"],`
	`"cause": glom.T["error"]["caused_by"]["reason"]`
	`},`
	`default=None`
	`),`
	`}`
	`}`
	`),`
	`glom.Merge(),`
	`),`
	`)`

There’s a lot to unpack here in the glom “spec”, and I’ll walk through it below. glom has an excellent tutorial that can explain it all better than me—and it has a browser-based REPL!—and it’s how I figured a lot of this out. The rest of their docs are well written and comprehensive, so check them out.

glom.glom takes a target nested object and a spec to transform it. Everything we want is under the "items" key, so that’s the path part of our spec.
Nested under body["items"] is a list of dicts, all with an "index" key. Line 6 is a sub-path that tells glom to produce an iterable of the contents of each "index" within body["items"]
Lines 7–20 are a sub-path that tells glom to produce an iterable of a dictionary comprehension where the key is the "_id" of each "index" target dictionary—glom.T accesses the target path—and the value is a dictionary with "status" and "error" keys.
1. The "status" comes directly from the "status" in the "index" target.
2. "error" is more involved and where we start to restructure things.
1. We decided earlier that we want "error" in any case, using None as a signal that there’s not actually an error. glom.Coalesce to the rescue on Lines 11–17. If it can’t create something out of the sub-spec we passed in, the default=None will become the value.
2. For our "reason" we want to take the first-level ["error"]["reason"] from the target "index" dictionary.
3. For our "cause" we want to take the ["caused_by"]["reason"] that is nested within the ["error"] in the target.
On Line 21 we use glom.Merge() which combines all of the prior Iter specs together into one resulting object.

While it might look intimdating at first, it’s wildly powerful and this example barely scraches the surface of its capabilities. On top of that, when you consider the functional difference in our two hand-made implementations, the difference between glom implementations to produce the same result is smaller and no more complex.

To come back to cyclomatic complexity, the glom impementation of parse_body checks in at 1. To a caller it has no branches, no loops, none of that. It’s a function and it returns a dictionary. That’s not to say it’s not a complex piece of software, but that it takes care of the complexity for you.

Testing our manual versions of this might require a bunch of test cases to ensure we’re covering all of those branches, and will probably require we do something different about those dictionary lookups. Testing our glom implementation requires passing in a body that includes both cases we’re looking at—error and success—and seeing that we get a good result. I’m very picky about dependencies, but as of this writing glom has 97% test coverage and great documentation, so I’m comfortable letting it do the work for us.

Conclusion

Use glom.
Thank you Mahmoud Hashemi for creating this wonderful piece of software, and thanks as well to anyone else who’s contributed to it.
Check out the source at https://github.com/mahmoud/glom — it’s a very well done project.