- How do you get the data?
- How do you count the frequency of words in titles and blurbs?
- How are topic tags assigned to books?
- How are the covers arranged in the cover montage?
- What genres in the Kindle Store are available?
- Can you write newsletters for other categories?
- How do you estimate whether a blurb is in first- or third-person POV?
- How do you define whether a blurb is mostly about ‘setting and events’ or ‘characters’?
- How do you define ‘trad’ and ‘non-trad’ in the newsletter?
- What do you call a ‘genre’, and what do you call a ‘category’?
How do you get the data?
Scraping the Kindle Store is done in Python using the scrapy library, and the data is extracted to JSON format and aggregated. Text analysis is done using spacy, and image analysis using imagehash. Plots are done using seaborn. If you’d like to know more about the technical detail, or about the orchestration and infrastructure, just send me an email and I’ll be happy to talk about it.
How do you count the frequency of words in titles and blurbs?
The ‘frequent words in titles and subtitles’ section, and the ‘frequent words in blurbs’ wordcloud and table only counts specific words once per author, and once per series. What this means is that for any single author, a common word will only be counted once for that author, and for a series with multiple authors, the series words will only be counted once for that series. I’m doing this to correct for the ‘Harry Potter effect’, where ‘harry’ and ‘potter’ are always going to be frequent words in the SF/F Top 100 – this is technically correct, but in my view it’s not very useful information.
My argument for this is that I think six authors using the same word on one book each is more important than one author using the same word on six books. The latter is maybe just branding, whereas the former might indicate a bunch of people (semi-) independently coming to the same conclusions.
How are topic tags assigned to books?
For each book, Topic Tags are a list of what topics are mentioned in the blurb, title, subtitle and series name. These might be tropes, or themes, or niches, or subgenres – anything that might give us clues to the content in the book. If you’re familiar with Netflix, it’s basically like the tags Netflix attaches to their titles. For each topic tag, I have a list of matching words or phrases for each tag. For instance, ‘Crime’ matches to ‘rob*’, ‘heist’, ‘thief*’, and so on. Using this list, I scan the blurb, title, subtitle and series name for each book. If it has any of those matching words, it gets that tag.
Of course, this kind of approach is only as accurate as the list, and it’s always going to be a bit of an average – so I put it together by talking to a bunch of authors in each genre, and trying to sort of find a consensus. If you’d like to see the list, you can download it here. (technical note: it uses regexes to match multiple terms). If you see any that you think aren’t right, or find a book that should have a particular tag but doesn’t, please let me know. Ultimately this is a matter of general agreement that specific tags tell us something meaningful about the content of specific titles – it’s not up to me or anyone else alone. I tweak it a little bit every month or so, to make sure it stays current. If there are specific things you think should be on there based on your genre knowledge, I’d love to hear about them.
At the moment I think this simple approach is (conservatively) about 80% accurate – that is, it’s about 80% the same as what a human would do if you sat them down with this list of tags and a blurb, and said ‘choose the tags which represent this blurb’. We could do something way more complicated using machine learning here, and if I were gunning for venture capital money that is exactly what I’d do. But in reality I don’t think it’s worth it, because we’d get ~5-10% accuracy improvement for a big, big tradeoff – we wouldn’t know why a particular tag was being assigned to a particular book. Not being able to know the ‘why’ is a big downside for machine-learning approaches in text analysis.
Sometimes it gets stuff a bit wrong, and sometimes it doesn’t add any tags at all; this is usually when the blurb is quite abstract or does something unexpected. Again if you see some examples where you think a book should be tagged something, please send them my way and I’ll see what I can do to make changes. Sometimes, though, I read a blurb that has no tags, and I think as a human ‘I have no idea what’s in this book!’ – in that situation I think no tags is the right behaviour.
How are the covers arranged in the cover montage?
Covers are ordered using an image-processing and clustering algorithm to put the most ‘similar’ covers together. This is always a bit of a judgement call, but it seems to work pretty well, inasmuch as styles of covers tend to go together, and you’ll see a number of situations where books in a series with very obviously the same style of cover, but different tones, are put next to each other. (technical note: it’s the ‘wavelet hash’ method in the imagehash library performed on images with zero saturation, but there are plenty of other approaches; it turns out this is actually quite a complex problem).
There’s no one perfect way of ordering covers; it used to be just 1-100, but after discussions it became clear that we had an opportunity to add some more information – we have the Best Sellers Rank in the spreadsheet already, and if you want to see the covers in order, you can just go to the Cover Gallery in the Dashboard and sort them by Best Sellers Rank.
I tried sorting by ‘major color’ as well, which in some cases was useful, but talking to people suggested pretty strongly that they grouped covers by ‘style’ when they were doing their own research; e.g. they had ‘single man, facing front’, and ‘man and woman, facing each other’, ‘floating faces with text in middle’, and so on. So that’s what I’ve tried to recreate programmatically.
What genres in the Kindle Store are available?
Currently we’ve got the following:
One-Hour Romance (KSR)
Science Fiction Romance
Teen &Young Adult
Paranormal and Urban Fantasy
Mystery, Thriller & Suspense
Science Fiction & Fantasy
Can you write newsletters for other categories?
Absolutely, depending on demand. Drop me an email, and let’s talk about what would be most useful for you.
How do you estimate whether a blurb is in first- or third-person POV?
For each blurb, we get all the personal pronouns in the blurb which aren’t within quotation marks*, and sort them. If there are more first-person personal pronouns (‘I’, ‘me’,’my’,’mine’,’myself’), than third-person pronouns or proper names, then we estimate the blurb is written in first-person POV. If it’s the other way around, we estimate the blurb is third-person. If there aren’t any, we mark it ‘unknown’. As before, this isn’t perfect and can go wrong on blurbs which are written in an unusual way, but it’s about 80% accurate from spot-checking.
*We exclude text within quotation marks because it’s typically testimonials from readers, which tend to be in first-person, as in “I loved this great book! – A Reviewer”. Although this is part of the blurb, it isn’t necessarily consistent with the rest of it, and you do see a lot of third-person blurbs with this kind of snippet at the end.
How do you define whether a blurb is mostly about ‘setting and events’ or ‘characters’?
For each blurb, if a sentence has a personal pronoun in it (I, me, my, she, he, hers, his, they, etc.) then I say that sentence is ‘about a character’. If it doesn’t, then it isn’t about a character.
Then, we can find the fraction of sentences ‘about’ characters, to the total number of sentences in the blurb. Blurbs which are very strongly ‘I did this’, ‘will he be able to do that?’ will have a lot of sentences about characters, and we judge they are character-centric. On the other hand, blurbs which don’t have a lot of sentences about characters and are strongly ‘In a world riven by turmoil and discord…’ etc. we judge are setting and event-centric.
This will clearly vary by genre, but it’s a way to try and break down ‘styles’ in different blurbs a bit. As always, neither is better or worse, but the more we understand about blurb styles, the better we can craft our own blurbs.
How do you define ‘trad’ and ‘non-trad’ in the newsletter?
At the moment, I mark a title as ‘trad’ if the Publisher field is an imprint of one of the Big Five (now Big Four) trade publishers. Otherwise, I call it ‘non-trad’, meaning ‘independent- or self-published’. If you want to see the current list of ~290 imprint names, you can download it here. I revise this regularly; if there are any you think should or shouldn’t be on there, please drop me a line.
This is a bit subjective, and will always be open to interpretation; the reason we’re trying to make this general distinction, though, is to look for differences between self-published and non-self-published books in a particular genre. Some genres have striking differences between the two, but in others you can’t really tell. Neither is better or worse than the other, but it’s good to be able to perceive differences if they do exist.
What do you call a ‘genre’, and what do you call a ‘category’?
When I say category, I mean ‘the Amazon data structure which manifests itself on the Kindle Store as an entry in the list on the left-hand side of the website when you’re browsing books, and has a Top 100 etc., and on the back end has a ‘category ID’ (also called a ‘Browse Node ID in the API) that uniquely identifies it.
Categories have a hierarchical relationship, and in a few cases they converge – that is, a single category belongs to two different hierarchies (which is a pain, by the way, because there’s no list of those cases).
When I say genre, I mean ‘the subjective notion of genre as generally understood by authors and readers’. As authors we sometimes have headaches because some genres are clearly and comprehensively represented by categories (for instance, Western Romance), and others aren’t (for instance, small town romance). Furthermore, genres are subjective, non-exclusive, fuzzy-edged (e.g. women’s fiction vs. chick-lit) and rapidly changing (e.g. litRPG and reverse harem didn’t really exist 3-4 years ago, but now they’re important).
It’s often quite hard to define a genre by a string of words, though. Some genres like ‘litRPG’ or ‘mpreg’ are more straightforward, because they are pretty much guaranteed to have those words in them somewhere (‘gamelit’ notwithstanding).
But other genres (for instance, JAFF as I understand from my colleagues) don’t have a single word or set of words you can guarantee are in the title/blurb/keywords. The important thing here is that this isn’t computer knowledge; it’s market knowledge. We still need to read and research to understand the market; it’s just that with easy access to data we can do it much more efficiently, and be more confident that we haven’t missed anything.