Skip to main content

Use glom

Dictionaries are really cool. They’re very powerful and a large part of Python: so much stuff is built on the dict type. They’re generally quite easy to work with, but what about when you have nested dictionaries that contain lists that contain dictionaries that contain certain keys only in some cases and sometimes there’s another list involved?

Use glom.

Elasticsearch Bulk API Responses

The Elas­tic­search Bulk API is an ef­fi­cient way to per­form ac­tions on mul­ti­ple doc­u­ments in­stead of us­ing mul­ti­ple call­s.

Because it’s doing more than one action, the response from this call is a bit complex. If the request body was something that could be processed at all, the HTTP response code will be 200. Beyond that, you need to look at the JSON response for an errors key at the top level. This lets you know if any of the individual items were unsuccesful.

That items list is where things can get tricky. Here’s a stripped down example of a 200 response that includes one error and one success for two documents we sent with the index action:

{"errors": true,
 "items": [
   {"index": {"_id": "F6bbqHIBgo1082mzZuO3",
              "error": {
                "caused_by": {
                  "reason": "For input string: '2020-06-12T14:07:30.452649+00:00'",
                  "type": "illegal_argument_exception"},
                "reason": "failed to parse [usage.range.gte]",
                "type": "mapper_parsing_exception"},
              "status": 400}},
   {"index": {"_id": "GKbbqHIBgo1082mzZuO3",
              "result": "created",
              "status": 201}}]}

What I want to do with this response is turn it into a dict where the _ids are keys and their values are a dictionary of consistent keys and values regardless of the success of the action. This way I can more easily do something with the failures and move on from the successes. That’s kind of tricky!

Each index includes the _id that we’ll need and a status, but beyond that it depends on how the action fared as to what else it contains. My plan is to make all _id keys in my dict include an error which will either be None if it worked, or the relevant error details if it didn’t. That gives me one place to look to know success or failure for each item.

Write it ourselves

This isn’t initially that hard to solve. If error exists we use it, if not we use the default None from dict.get.

def parse_body(body):
    if (items := body.get("items")) is None:
        raise Exception("No items in this response")

    result = {}

    for item in items:
        index = item["index"]
        result[index["_id"]] = {
            "error": index.get("error"),  # None by default
            "status": index["status"],
        }

    return result

…which re­turn­s:

{'F6bbqHIBgo1082mzZuO3':
  {'error':
    {'caused_by':
      {'reason': 'For input string: "2020-06-12T14:07:30.452649+00:00"',
       'type': 'illegal_argument_exception'},
     'reason': 'failed to parse [usage.range.gte]',
     'type': 'mapper_parsing_exception'},
   'status': 400},
'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}

That’s relatively straightforward, but it kicks the problem down the road. In the error case now we have another nested dictionary which I want to flatten out. What I really want are the two reason values so I can log them and more clearly point out what’s going wrong.

Write it ourselves, but better

To get a flatter error, something like this could do it:

def parse_body(body):
    if (items := body.get("items")) is None:
        raise Exception("No items in this response")

    result = {}

    for item in items:
        index = item["index"]
        details = {"status": index["status"]}

        if (error := index.get("error")) is None:
            details["error"] = None
        else:
            details["error"] = {
                "reason": error["reason"],
                "cause": error["caused_by"]["reason"]
            }

        result[index["_id"]] = details

    return result

…which re­turn­s:

{'F6bbqHIBgo1082mzZuO3':
  {'error': {'cause': 'For input string: "2020-06-12T14:07:30.452649+00:00"',
             'reason': 'failed to parse [usage.range.gte]'},
   'status': 400},
'GKbbqHIBgo1082mzZuO3': {'error': None, 'status': 201}}

That’s much bet­ter! How­ev­er, this is quick­ly be­com­ing more com­plex. We still need to test this, and be­tween the first and sec­ond ver­sions we added more branch­es in the code that we’ll need to cov­er.

The cyclomatic complexity of our new approach went from 3 to 4 as measured by the mccabe library, named for Thomas McCabe, who coined the metric. Metrics aside, we can see this code is growing more ifs and loops and indexing the more we add to it, and we’re making a few assumptions that we won’t end up with a KeyError on any of those lookups.

/images/use-glom/pepesilvia.thumbnail.jpg

Char­lie Kel­ly de­sign­ing our third at­tempt at writ­ing this.

Use glom

glom is a li­brary for “Re­struc­tur­ing data, the Python way.” It was made to solve our prob­lem.

Here’s what a solution that meets our needs looks like using glom. It returns the exact same dict as the second parse_body function.

def parse_­body(body):
     re­turn glom.glom(
         body,
         (
             "item­s",
             glom.Iter(glom.T["in­dex"]),
             glom.Iter(
                 {
                     glom.T["_id"]: {
                         "s­ta­tus": glom.T["s­ta­tus"],
                         "er­ror": glom.Co­a­lesce(
                             {
                                 "rea­son": glom.T["er­ror"]["rea­son"],
                                 "cause": glom.T["er­ror"]["caused_by"]["rea­son"]
                             },
                             de­fault=None
                         ),
                     }
                 }
             ),
             glom.Merge(),
         ),
     )

There’s a lot to un­pack here in the glom “spec”, and I’ll walk through it be­low. glom has an ex­cel­lent tu­to­ri­al that can ex­plain it all bet­ter than me—and it has a browser-based RE­PL!—and it’s how I fig­ured a lot of this out. The rest of their docs are well writ­ten and com­pre­hen­sive, so check them out.

  1. glom.­glom takes a tar­get nest­ed ob­ject and a spec to trans­form it. Ev­ery­thing we want is un­der the "item­s" key, so that’s the path part of our spec.

  2. Nest­ed un­der body["item­s"] is a list of dicts, all with an "in­dex" key. Line 6 is a sub­-­path that tells glom to pro­duce an it­er­able of the con­tents of each "in­dex" with­in body["item­s"]

  3. Lines 7–20 are a sub­-­path that tells glom to pro­duce an it­er­able of a dic­tio­nary com­pre­hen­sion where the key is the "_id" of each "in­dex" tar­get dic­tio­nary—glom.T ac­cess­es the tar­get path—and the val­ue is a dic­tio­nary with "s­ta­tus" and "er­ror" keys.

    1. The "s­­ta­­tus" comes di­rec­t­­ly from the "s­­ta­­tus" in the "in­dex" tar­get.

    2. "er­ror" is more in­­­volved and where we start to re­struc­­ture things.

    1. We de­­cid­ed ear­li­er that we want "er­ror" in any case, us­ing None as a sig­­nal that there’s not ac­­tu­al­­ly an er­ror. glom.­­Co­a­lesce to the res­cue on Lines 11–17. If it can’t cre­ate some­thing out of the sub­­-spec we passed in, the de­­fault­­=None will be­­come the val­ue.

    2. For our "rea­­son" we want to take the first-lev­el ["er­ror"]["rea­­­son"] from the tar­get "in­dex" dic­­tio­­nary.

    3. For our "cause" we want to take the ["­­­caused_by"]["rea­­­son"] that is nest­ed with­­in the ["er­ror"] in the tar­get.

  4. On Line 21 we use glom.Merge() which com­bines all of the pri­or Iter specs to­geth­er in­to one re­sult­ing ob­jec­t.

While it might look in­tim­dat­ing at first, it’s wild­ly pow­er­ful and this ex­am­ple bare­ly scraches the sur­face of its ca­pa­bil­i­ties. On top of that, when you con­sid­er the func­tion­al dif­fer­ence in our two hand-­made im­ple­men­ta­tion­s, the dif­fer­ence be­tween glom im­ple­men­ta­tions to pro­duce the same re­sult is small­er and no more com­plex.

To come back to cyclomatic complexity, the glom impementation of parse_body checks in at 1. To a caller it has no branches, no loops, none of that. It’s a function and it returns a dictionary. That’s not to say it’s not a complex piece of software, but that it takes care of the complexity for you.

Test­ing our man­u­al ver­sions of this might re­quire a bunch of test cas­es to en­sure we’re cov­er­ing all of those branch­es, and will prob­a­bly re­quire we do some­thing dif­fer­ent about those dic­tio­nary lookup­s. Test­ing our glom im­ple­men­ta­tion re­quires pass­ing in a body that in­cludes both cas­es we’re look­ing at—er­ror and suc­cess—and see­ing that we get a good re­sult. I’m very picky about de­pen­den­cies, but as of this writ­ing glom has 97% test cov­er­age and great doc­u­men­ta­tion, so I’m com­fort­able let­ting it do the work for us.

Conclusion

  1. Use glom.

  2. Thank you Mah­­moud Hashe­­mi for cre­at­ing this won­der­­ful piece of soft­­ware, and thanks as well to any­one else who’s con­trib­ut­ed to it.

  3. Check out the source at http­s://github.­­com/mah­­moud/­­glom — it’s a very well done pro­jec­t.