- How do you get the data?
- How do you count the frequency of words in titles and blurbs?
- How are the covers arranged in the cover montage?
- What genres in the Kindle Store are available?
- Can you write newsletters for other categories?
- What do you call a ‘genre’, and what do you call a ‘category’?
How do you get the data?
Scraping the Kindle Store is done in Python using the scrapy library, and the data is extracted to JSON format and aggregated. Text analysis is done using spacy, and image analysis using imagehash. Plots are done using seaborn. If you’d like to know more about the technical detail, or about the orchestration and infrastructure, just send me an email and I’ll be happy to talk about it.
How do you count the frequency of words in titles and blurbs?
The ‘frequent words in titles and subtitles’ section, and the ‘frequent words in blurbs’ wordcloud and table only counts specific words once per author, and once per series. What this means is that for any single author, a common word will only be counted once for that author, and for a series with multiple authors, the series words will only be counted once for that series. I’m doing this to correct for the ‘Harry Potter effect’, where ‘harry’ and ‘potter’ are always going to be frequent words in the SF/F Top 100 – this is technically correct, but in my view it’s not very useful information.
My argument for this is that I think six authors using the same word on one book each is more important than one author using the same word on six books. The latter is maybe just branding, whereas the former might indicate a bunch of people (semi-) independently coming to the same conclusions.
How are the covers arranged in the cover montage?
Covers are ordered using an image-processing and clustering algorithm to put the most ‘similar’ covers together. This is always a bit of a judgement call, but it seems to work pretty well, inasmuch as styles of covers tend to go together, and you’ll see a number of situations where books in a series with very obviously the same style of cover, but different tones, are put next to each other.
There’s no one perfect way of ordering covers; it used to be just 1-100, but after discussions it became clear that we had an opportunity to add some more information – we have the Salesranks in the spreadsheet already, and if you want to see the covers in order, you can just go to the actual Top 100 page.
I tried sorting by ‘major color’ as well, which in some cases was useful, but talking to people suggested pretty strongly that they grouped covers by ‘style’ when they were doing their own research; e.g. they had ‘single man, facing front’, and ‘man and woman, facing each other’, ‘floating faces with text in middle’, and so on. So that’s what I’ve tried to recreate programmatically.
What genres in the Kindle Store are available?
Currently we’ve got the following:
One-Hour Romance (KSR)
Science Fiction Romance
Mystery, Thriller & Suspense
Science Fiction & Fantasy
Can you write newsletters for other categories?
Absolutely, depending on demand. Drop me an email, and let’s talk about what would be most useful for you.
What do you call a ‘genre’, and what do you call a ‘category’?
When I say category, I mean ‘the Amazon data structure which manifests itself on the Kindle Store as an entry in the list on the left-hand side of the website when you’re browsing books, and has a Top 100 etc., and on the back end has a ‘category ID’ (also called a ‘Browse Node ID in the API) that uniquely identifies it.
Categories have a hierarchical relationship, and in a few cases they converge – that is, a single category belongs to two different hierarchies (which is a pain, by the way, because there’s no list of those cases).
When I say genre, I mean ‘the subjective notion of genre as generally understood by authors and readers’. As authors we sometimes have headaches because some genres are clearly and comprehensively represented by categories (for instance, Western Romance), and others aren’t (for instance, small town romance). Furthermore, genres are subjective, non-exclusive, fuzzy-edged (e.g. women’s fiction vs. chick-lit) and rapidly changing (e.g. litRPG and reverse harem didn’t really exist 3-4 years ago, but now they’re important).
It’s often quite hard to define a genre by a string of words, though. Some genres like ‘litRPG’ or ‘mpreg’ are more straightforward, because they are pretty much guaranteed to have those words in them somewhere (‘gamelit’ notwithstanding).
But other genres (for instance, JAFF as I understand from my colleagues) don’t have a single word or set of words you can guarantee are in the title/blurb/keywords. The important thing here is that this isn’t computer knowledge; it’s market knowledge. We still need to read and research to understand the market; it’s just that with easy access to data we can do it much more efficiently, and be more confident that we haven’t missed anything.