The Yellow Word: insanity and pleasure in Old Bailey Online

Text and data mining (TDM) technologies read large amounts of digital data and are used to explore, dissect and understand texts. As technology advances and data is made available through projects such as The Old Bailey Online, readers are given more opportunity to conduct efficient research on open sets of data. (Michelle Brook, Peter Murray-Rust and Charles Oppenheim, 2014). Text analysis tools use quantitative data to conduct qualitative research, measuring the text to provide evidence for subjective observations.

The Old Bailey Online is a collection of all the court proceedings of the Old Bailey from 1674 to 1913, digitised to enable researchers to see the content of trials in-depth. Areas of interest might include the history of punishment, the justice system, or the casual family historian looking for a particular name; all able to implement data mining of the digital texts.

Considering how valuable textual analysis might be when approaching this kind of traditionally unstructured data, I conducted some searches in the collection. The Old Bailey API (below right) is a more concise version of the general Old Bailey search tool (below left).



The original search tool allows more freedom for free text searches. I searched for keyword: ‘insanity’, in murder cases, over the entire span of cases from 1674 to 1913. The list of results came back with links to individual cases, encouraging close reading of all relevant trials. The API differs in that it’s structured for more specific searches, with only one keyword field and more concise drop down menu options. The option to search by gender was useful; it enabled me to get more accurate results (searching for keyword: ‘woman’ in the general search wouldn’t have helped me much I’m sure).*


The results from the API search were much easier to work with. I was able to break down my results by keywords and interestingly saw that ‘pleasure’ was rated top alongside ‘murder’. I ‘drilled’ the word ‘pleasure’ and it refined the results instantly. The results came to me in a context which easily facilitated further textual analysis, and at this point I attempted to export the results to Voyant using the ‘send to Voyant’ option. This didn’t work however, with the 100 results I tried at first or with the 10 I tried the second time. The site didn’t seem to be able to handle exporting this way, but after saving the zip file using ‘Zip URL’ option I could then upload the file into Voyant separately.


I was surprised to find ‘pleasure’ wasn’t a prominent word in the word cloud, nor was ‘insanity’, despite them being listed as top keywords in my original results via the API. I went through the same process again with the next few sets of 10 trials from my Old Bailey results and found that the words ‘mother’ and ‘child’ featured heavily, which is telling. It seems as though the use of Voyant is still limited to smaller data sets, and therefore text analysis of part of the results from a collection like Old Bailey might not take into account enough data to create a fair, or useful, visualisation.

Qualitative research methods rely on interpretation, which is definitely needed with tools like Voyant, using unstructured data. Quantitative and qualitative data work together in these circumstances, with the former supporting the latter. Observations can be made over multiple sets of data, hunches followed and tones of individuals in the trials analysed, supported and facilitated by TDM technologies which depend on this kind of digitisation.


*Old Bailey Online Tweeted information about the gender search function: it does exist, but in the ‘custom search’ tool.



Every word cloud needs a silver lining: a brief assessment of text analysis tools

Text analysis is a useful way to compare, explore and understand a piece of text beyond simply reading. George Rockwell explains that text analysis systems can ‘search large texts quickly’, ‘conduct complex searches’ and ‘present the results in ways that suit the study of texts’. Pulling out key words in quantitative terms, may seem at odds with assessing a text’s quality in the traditional view we have of analysing literature (at least it does to me, a Creative Writing graduate), but Rockwell insists ‘simple text analysis tools can help with this process of asking questions of a text and retrieving passages that help one think through the questions’. The key word here is ‘help’.

Initially it’s interesting to remember where you may have seen text analysis tools at work. The simplist and most striking example is possibly the word cloud, a visualisation of most used words in a piece of text. For example, by entering David Cameron’s speech to the Conservative conference in October 2014 into Wordle, we can see what words, and, by extension, what subjects, he concentrated on most.


Editing the colour scheme to blues, and keeping the font simple, this is a visual representation of a Conservative speech. Here words, and by extension subjects, are shown in terms of their presence in his speech. Comparing this to similar key speeches of other main party politicians could highlight priorities, possibly using party colours. However,  New York Times senior software architect Jacob Harris urges annalists: ‘Don’t confuse signifiers with what they signify’ (Word clouds considered harmful, 2011). In his essay essentially critiquing the trend of oversimplifying data, he insists ‘word clouds support only the crudest sorts of textual analysis’, and mentions Wordle specifically. I agree to an extent, as although I would still use word clouds alongside further analysis as they provide a fun, striking visualisation, Wordle is perhaps more suited to to the casual data annalist.

I found Voyant, with it’s various ways of presenting results and comparative analysis slightly more useful. I used the ‘Intersectional Feminism’ data from a previous Altmetrics session (journal articles including the keywords shared on twitter in the last 6 months) to create this visualisation in Voyant. However, as the body of text imported was limited, due to the nature of Altmetrics recording by journal title and not body of text, I wouldn’t necessarily use Altmetric data in text analysis systems in future. Although I do really enjoy this appropriately colourful image.




The TAGS archive I created for the #FocusE15 campaign worked better with Voyant in that I was able to pre-emptively remove common words (and, the RT, etc.) using the stopwords tool, which has predetermined lists of words usually irrelevant to analysis, and also highlight words to view in alternative ways.


This represents the relative frequency of Tweets mentioning Russell Brand, who got involved with the campaign early on and attracted a lot of attention to it with his Youtube news show, The Trews. The nature of Tweets, with the limited characters and heavy focus on keywords through hashtags and user handles, is much more suited to text analysis tools., the collaborative behind Voyant, admits itself some of the features aren’t yet up to scratch, and I didn’t find it very fluent when working with my data. I can see the wide benefits however, especially when analysing Twitter, of creating visualisations of text to highlight trends and think the more casual, user friendly tools like Wordle at least encourage a basic exploration into text analysis.



Altmetrics for Art History

Usage data is an essential factor in any librarian’s role. As I’ve begun to look into the statistics relating to our online journals access in the relatively small library I work in, I’m particularly interested in Altmetric, and how it might facilitate data collection to do with the wider circulation of research using social media.

Altmetric is a relatively new programme measuring the impact of scholarly articles using three main factors: volume, source and author. Results will show the number of times an article is shared, taking into account the value of the platform (a newspaper article is worth more than a Tweet, for example), and also the relevance and value of the author, prioritising individual academics and researchers over publisher’s accounts, etc.

The Altmetric donut is compiled of colours relating to sources, and represents the impact of the individual article appearing in search results.

Image taken from the Altmetric blog:

For example, I searched for mentions of the keyword: ‘feminism’ and narrowed it down to include articles Tweeted in the past month. Here is one of the highest ranking, with the demographics tab open:


This shows a mostly blue donut, due to the 270 Tweets sharing the article, compared to the second most popular sharing platform Facebook, represented with a darker blue, only having 48 shares. The variety of users sharing the articles featured in my search results was to be expected, and lead me to some really interesting open access articles. To look at data relating to art history, I searched for articles shared in the journals most used in my own library at the National Portrait Gallery (rough statistics using data acquired the old fashioned way: measuring the circulation of physical journal copies.)


The results were poor, with the highest donut of 9 coming from the Oxford Art Journal (results within the last 6 months, shared using Twitter). A great feature of Altmetric is the tool to export articles onto a saved spreadsheet, which I tried out a few times experimenting with feminist donuts as mentioned before. I haven’t yet managed to find enough data to benefit from a detailed spreadsheet, but will continue to play around with searches and features to improve accuracy and enable comparison.

A lot of debate is had over the correlation between sharing and citing, and the ‘need to look beyond the numbers, and question who is actually behind all that tweeting, citing, clicking and downloading.’ (E. Priego, 2012). But the initial data seems to show a lack of traffic, which is likely to have an impact on both quantitative and qualitative analysis.

LSE made lists of academic Tweeters in 2011 in the run up to launching their guide to Twitter for academics. The Arts list is surprisingly lacking in Arts academics, mostly focusing on philosophy, history and literature. Are art historians falling behind in their use of social media, most noticeably Twitter, as it’s now so heavily used by researchers in other disciplines? Or is there some sort of barrier  between Twitter as a sharing and collaborating platform and the academic research in art history?As Ernesto Priego put it in his article On Metrics and Research Assessment: ‘If increased international public access and impact are to be key factors in 21st century research assessment, the adoption of metrics, and particularly article-level metrics, is essential.’  I’m interested to see if I can find out more about the current extent of public access and impact in the subject of art history using Altmetric.

Focus on TAGS and #FocusE15

Twitter, as one of the most popular social networking platforms, is a heavily used resource when it comes to social activism and commentary is the 21st Century. To fully evaluate the role of Twitter in inspiring, organising and documenting activism, we need to consistently collect and compare the data it’s generating.

In August 2014, Nora Daly hosted a live #NewsHourChat via PBS to discuss ‘hashtag activism’. In her introduction to the accompanying article she sums up the arguments for and against the phenomenon:

Proponents of hashtag activism celebrate its ability to raise awareness and magnify voices that might not otherwise be heard. Critics claim that hashtags rise and then quickly fade from public consciousness, in part because they are often embraced by individuals who have little or no vested interest in the cause.

Using the hashtag #IfTheyGunnedMeDown, which relates to the shooting of Mike Brown by police officers , Daly illustrated the brevity in circulation of sensationalist issues on Twitter.

This information comes from the certified Twitter partner Topsy, a social search and analytics company that, as of September 2013, has access to every public Tweet ever published. However, rather than that sort of all encompassing big data, my research operate on a much smaller scale.

I’m interested in the success rates of the use of hashtags, or consistent and searchable phrases, for raising awareness of campaigns, particularly on a small, local community based level, and will be using TAGS (Twitter Archive Google Spreadsheet) to attempt to measure the the levels of awareness for one particular campaign.

Focus E15 Mothers, a group of young mothers from Stratford who were evicted from their hostel in 2013 with nowhere to go, decided to occupy local council housing that had been left empty despite a serious need for social housing in the area. Using the hashtag #FocusE15 (and variations, e.g. #FocusE15Mothers) they were able to draw in interest across London (I first noticed the campaign when Twitter told me it was trending in my area), and significant events such as their court hearing were shared and discussed in real-time.

Read about the roots of the campaign here. This of course isn’t the first time communities have used social media to pull together current information in place of more traditional news sources. During the London riots of 2011, people were using Twitter to get real-time news about local developments. I can vividly remember sitting in my bedroom in my old Brixton flat, smelling smoke coming in from the streets and refreshing my search for #Brixton and #LondonRiots to find out exactly how trouble in my area was escalating, directly from fellow citizens. In the aftermath of the riots the site was accused by politicians and the police of enabling rioters to connect with each other and organise their crimes. However this was disproved by a study soon after, which was used in this article as evidence that Twitter actually generated more Tweets concerned with the #riotcleanup operation, and that Tweets responding positively to the riots were generally condemned or ignored.

The study, conducted as part of Reading the Riots, the Guardian and London School of Economics investigation into the riots, was based on a database provided by Twitter. Relevant tweets were drawn from dozens of riot-related hashtags – such as #EnglandRiots or #BirminghamRiots – which were used at the time to pool tweets about the same subject.

Read more about the study’s findings here.

This research relied heavily on archiving Tweets through processes similar, but on a much larger scale, to the experimenting I’ve been doing with TAGS, an API created by Martin Hawksey. When I first encountered TAGS and projects relying on collecting Twitter data I wondered about the differences in using archived data and relying on the search function Twitter provides. I found this post about search results on Twitter’s Engineering Blog really useful; it told me how the personalised search function works and why you might not see every possible result when you search.

There is a lot of information on Twitter — on average, more than 2,200 new Tweets every second! During large events, for example the #tsunami in Japan, this rate can increase by 3 to 4x. Often, users are interested in only the most memorable Tweets or those that other users engage with. In our new search experience, we show search results that are most relevant to a particular user. So search results are personalized, and we filter out the Tweets that do not resonate with other users.

To me, this highlights the limitations of the Twitter search. Archiving Twitter data is extremely useful in many research disciplines, because it allows you to reflect over longer periods of time, using more data, and archive APIs promote the sharing of data, ultimately ‘strengthening research networks and fostering exchange and collaboration’ (E. Priego, 2014) . Tracking the way Twitter users communicate, studying trends in information sharing and geographical interests, etc., is really important to research now that approximately 284 million people use the service (1).

I began experimenting using the #FocusE15Mothers hashtag I recalled was popular during the time the campaign began. Interestingly, I received no results for this hashtag when creating a spreadsheet, meaning that in the past 7 days, nobody had been discussing issues under this hashtag. By using the simple Twitter search I found that accounts previously using the full tag had began to switch to the shorter version, #FocusE15, an example of how often hashtags can evolve with use to adapt to the culture of convenience depended on in social networking. By creating an archive of the usage of this specific phrase from 1st November 2014, I can follow the progress of the campaign as it becomes more involved in the wider issues of social housing in London. The TAGSExplorer function allows me to see the way the campaign is fitting into other areas of discussion on Twitter, such #NewhamHomes and #OccupyDemocracy.




It’s a shame I don’t have access to the data from the beginning of the campaign as I’d be interested to see if it would back up my suspicion that interest is waning, but this in itself highlights the importance of archiving sets of Twitter data. Using the Topsy search function, I can trace the first Tweets on the subject (9 months ago, when journalist Kate Belgrave Tweeted her piece on the emerging story), but the format of results isn’t interactive or descriptive enough. Another factor I’d be really interested in following is the location of users Tweeting about FocusE15. So far, very few Tweets include user location (due to individual user account settings), but those that do have come from within East London. I’m interested in using the data I collect to follow the progress of the campaign’s reach via Twitter, and explore the uses of Twitter archives for research in the context of small-scale activism.


(1) Figure as of 29/10/2014, taken from this really useful site:

On The Road in 2014: APIs and embedding at work

Before I knew my employer had the funding to help me attend #citylis this academic year (a lovely surprise!), I planned and booked a road trip across the southern states of America (Austin, Texas to Gainesville, Florida). I’ll be missing one day of classes, but in America the streets are paved with WiFi, so I’m going to try my best to keep up with reading.

Planning a trip like this I’ve relied heavily on web services, placing faith in the internet community to provide me with information on a whole range of useful areas. The APIs and Embedding lecture as part of the DITA module helped me realise just how much we rely on APIs (application programme interfaces), and embedding.

an API describes how systems can work together, normally through the form of web services

Libraries Hacked

The content on heavily populated web servers is unique for everyone, as for the most part we customise our own feeds for sites like Facebook and Twitter, and shape the content we’re presented with in sites like Amazon and ASOS, often without realising. In her book The People’s Platform: Taking Back Power and Culture in the Digital Age (Fourth Estate, 2014), Astra Taylor discusses the nature of Web 2.0 and how it trades on our sociability.

Web 2.0 is not about users buying products, rather, users are the product. We are what companies like Google and Facebook sell to advertisers.

– A. Taylor (2014)

APIs allow companies to follow you around on the web collecting data and using existing data across various platforms to draw you back to their service. Specific items you viewed on Ebay appear on the side of your Facebook feed, and sites push you to ‘like’ products which link back to your Facebook. Everything we do on the internet is shaped by the nature of data sharing between systems: we rarely open up a website without it clocking our Facebook, Twitter, Google accounts, and so on.

Content is no longer king… connections are.’

– A. Taylor (2014)

Websites host users’ real life experiences from the initial general searches to the post-trip two year Timehop.  If you Googled places to visit in New Orleans, LA, it’s normal to expect flight deals and hotels to pop up in advertising windows on the sides of your Facebook feed. The advertisements also come in a more subtle form, as suggestions of ‘articles’ for you to read that are actually editorial pieces on services or products.

The way Google Maps interacts with other websites via APIs is incredibly convenient. It’s really useful reading up about a place and having an embedded map on the page I know I can click through to, enlarge, type in other locations to get directions, and print off.


On a social level, friends can post a photo of a trip onto Instagram and have it appear on Facebook and Twitter, using the similar to software to build connections between the various applications that make up their online presence. Hashtags can be used universally (especially since Facebook began supporting them), which encourages posting across multiple formats. At the end of the trip we’re heading to Fest, a music festival in Florida. Many of the bands I’m seeing at the festival I heard with relative ease via embedded Soundcloud or Spotify links, either via social media or news sites. I often Tweet Spotify links as it’s a great and easy way of recommending music.


Almost everything we do now is reflected online in the services we use and contribute to. As well as trading off of our sociability, I think we depend on the sociability of web services to interact with each other, share data and allow cross-posting to complement our growing online presence.


The People’s Platform: Taking Back Power and Culture in the Digital Age / Astra Taylor

Relationship counselling: making meaningful connections and getting what we want

Information Retrieval, to the casual Google user, is all about speed, accuracy and pre-emptive results; the mark of a search engine going above and beyond the call of duty. If I type in a restaurant name into a search engine such as Google, it’s likely a link for the website, a map of its location and a few pages of reviews will be returned to me, sometimes alongside an embarrassing spelling correction. Google understands what I want to find out, despite the absence of any potentially relevant keywords (‘location’, ‘contact, ‘Lonely Planet’, perhaps). It’s as though it’s mastered the art of conducting the reference interview in 0.36 seconds, something librarians have been doing for a long time.

Reference Interviews are traditionally conversations between a librarian and library user, where the librarian attempts to clarify exactly what the user wants to find out, the end result, sometimes resulting in the suggestion of different search terms, techniques or even alternative resources.

One of my library’s most used resources is the collection of sales catalogues we have which dates back to the late 1800s. The series is mostly made up of catalogues from Bonhams, Sotheby’s and Christie’s sales, usually focusing on collections containing significant pieces of portraiture. After discussing a new project to create digital catalogue records for the sales catalogues, I used the reference interview as a method of researching the way library visitors used the resource and what they were typically hoping to find out. The results of these conversations and the varied interests and areas of study reflects the diverse nature of the Gallery’s remit.

At the National Portrait Gallery we’re concerned with documenting and providing an insight into the lives of important British people through the art of portraiture. The Primary Collection holds paintings, photographs, sculptures and drawings of just over 11,000 people, and the Heinz Archive and Library at the Gallery holds information on over 1 million significant sitters and artists. (Find out more about the collections here.)

In 2007, the library began to produce very basic catalogue records for incoming bound volumes on the library management system, EOS, searchable by title and date. Before this, we had no digital records for the catalogues, but relied on a card catalogue still used today to search for historic sales. The aim of the project is to create detailed entries for these bound volumes in our library management system, forming links between owners, institutions, locations and uniquely identifying our copies. This will hopefully produce more accurate results for searches conducted in the OPAC and will allow us to move away from relying on the card catalogue. Patterns will be traceable between subjects linked with Library of Congress authorities, placing them into the context of the collection as a whole using sophisticated cataloguing techniques.

Something I’ve begun to think about in relation to the course learning is the idea of suggested searches, assumed corrections (‘did you mean..?’) and limitations of databases and search engines such as Google. In the library we cut out and mount portraits from second copies of the sales catalogues we keep, and these contribute to a huge physical collection of image files we have, organised by sitter and again by artist. Probably our most used library resource, it’s curated by my department, as historically we have added images we assume are what users will search for, and prioritising this over recording data of the source material, the sales catalogue. What if the user is researching something more obscure, or rather than particular object, is looking at relationships and reflecting on patterns in a whole range of data? In terms of a casual Google searcher after a restaurant to eat at, perhaps I wasn’t looking for the critically acclaimed Spice Inn in Stratford, but rather the Spice Inn in a small East Midlands town, without an up to date website, entry in Lonely Planet or menu available online.

Is the increased intelligence of a Google search resulting in some information falling at the wayside? How can we, and should we, as librarians, influence and mediate search results in the way we organise and classify data? Does the recording of certain information deemed significant create a cycle of significance based on easier access, popular searches and information digitally available? I like to think that as we continue to put more records into our library management system, and attempt to organise and relate more subjects, the wealth of knowledge about our existing collections grows, and with it potentially the way we think about important British people and portraiture.

‘Nothing is higher than Architect’

– George Costanza, Seinfeld.

This blog will follow my progress through the Digital Information Technologies and Architecture module of the Information Science course at City University (#citylis). I began this course after three and a half years in a small art library at assistant level, where I’ve witnessed the collection’s online presence and digital access capabilities expand. I’m really looking forward to exploring information architecture and can see already how relevant and useful the knowledge is on a day to day level.

Something I often get asked is ‘what do you want to be when you graduate?’, as people aren’t sure what studying Information Science might lead to. I change my answer every time, to mix it up a bit, and because I’m really not sure. A job title I’m definitely going to start offering up is Information Architect, as it’s such a great way of describing so many aspects of my job and what I love doing.

It wasn’t until the first exercise of this module that I realised I already have strong opinions on what makes a good website. Drawing basic versions of some sample sites using squares and rectangles, I saw how important it is that information is laid out so that it can be found and connected in a logical way, leading a reader from one part to another as they put together the content you’re providing. Good information architecture creates a dot-to-dot drawing easily completed by any user, while being flexible enough to result in a picture unique to their experience.

I constructed DITA Ritz on a basic theme, with clear menu options and a Twitter widget running down the right hand side. A few of the themes I tried at first had the widget as a footer, which to me didn’t complement the time line layout of a Twitter feed, so in the end I opted for the very basic Skeptical Theme. Something I like about this theme is that it places the About page link in the top right corner, slightly separate from the archive of posts on the left. The About page is the context in which all visitors will be reading the content, so I put it up there with the blog title and tagline: hopefully a consistent framework. It’s likely I’ll change things around as and when I notice something isn’t quite right, as that’s in my nature, but luckily I’m finding the WordPress site easy to navigate.*

I can already tell I’ll enjoy being an Information Architect.

*EDIT 21/10/14: I’ve since changed my blog theme to Penscratch, which I prefer as the page appears cleaner and sharper. The text box is also wider, which is useful when embedding images and maps especially, as it allows you to have them closer to the original size.