The end of DITA…

I can’t believe we’re already at Week 10… In our first lecture, when the idea of blogging was introduced, I was petrified by the thought of having to write a post each week! Who would even want to read what I was writing? It was going to be a disaster! Now, ten weeks on, I’m certain that I’ll continue to blog even after the DITA module has finished!

I’ve found the blogging experience extremely useful as a reflective tool and a means of consolidating my knowledge from each week. It’s also helped to flag up areas which I want to look at in greater detail, or which need further reading to gain a fuller understanding. For this reason I hope that next term I can expand the blog to incorporate my other modules.

Being able to read and comment on my fellow DITA students blogs has been an interesting and helpful study aid. As everyone has a different background coming into the module, the perspectives and focuses of people’s blogs have varied greatly, offering multiple insights to each topic and area. I’m hoping that fellow students will continue to blog also, as it’s been really valuable to read their thoughts week by week.

DITA has been a truly eye-opening module, with each week bringing an area of Information Science I’d never heard of, or a concept I know but have never really thought about. It has been hard, but enjoyable, work!

The Semantic Web

As so often with DITA, The Semantic Web was a daunting unknown to me…

The Semantic Web is a movement which desires an evolution from the current web, made up of an unstructured web of pages, to a web of data. The current web structure uses hyper-linked pages which are human readable, but the Semantic Web aims to incorporate metadata which is readable by machines- this metadata will help to identify information about pages, and how pages relate to each other. Robinson and Bawden (2012) describe the Semantic Web as:

…the idea that information on the web may be structured and encoded in such a way that its content and meaning are made explicit, so that they can be ‘understood’ by search engines and other software agents.

A metadata model commonly associated with the Sematic Web is the Resource Description Framework. This is a simple system organised by ‘triples’. The triples are based on a ‘subject’, ‘predicate’ and ‘object’ relationship, for example: {Caryl Burtner} {wrote} {The Exorcism of Page Thirteen}

We looked at the Semantic Web in reference to Artists Books Online. The mission statement for the project states that:

Artists’ Books Online is designed to promote critical engagement with artists books and to provide access to a digital repository of metadata, scans, and commentary.

Screenshot of the Artists' Books Online homepage.

Screenshot of the Artists’ Books Online homepage.

Artists’ Books Online uses Document Type Description to organise the information from XML files. These are focused on a three-level organisational hierarchy: ‘work’, ‘edition’ and ‘object’ The ‘work’ level describes the work as an overall concept or idea and may contain multiple elements; the ‘edition’ focuses on a specific edition, which may contain several objects; ‘object’ is the description of an individual item which you have infront of you. By browsing ABO you can view a record by each element of it’s hierarchy, see the screenshot below:

Screenshot showing different hierarchical views.

Screenshot showing different hierarchical views.

I found ABO an interesting project based upon an XML architecture, and am looking forward to seeing how it develops further.

Data Mining

In last week’s DITA Lab session we experimented with the Old Bailey Online- a searchable database containing details of nearly 200,000 criminal trials held at London’s central court between 1674-1913. Users can access the digital copy of the original text, as well as search using keywords. The site is easy to navigate and use, with two options for searching: from within the site itself, and via the Old Bailey Online API.

The API allows you to search using keywords, and filter your results with a number of options, for example, defendent gender, offence, victim gender, verdict, punishment… The API displays the results in a simple way, with the option to select a case and see the relevant text, break down your results via the options mentioned previously, and to find similar cases using the ‘more like this’ function. The API also allows you to export your results to Voyant Tools to analyse the details from within your chosen search. Whilst this would be an incredibly useful tool, the function would not work for me during the Lab session, or when I tried later at home (nor as I write this now!) which is unfortunate. However, I will keep trying…

Screenshot of the Old Bailey Online API using the keyword search 'Rippon'

Screenshot of the Old Bailey Online API using the keyword search ‘Rippon’

Having used visualisation tools for the previous two Lab sessions I can see that there are definite advantages to using them as an analysis for large text data. Voyant Tools highlights the most commonly occurring words in a series of texts, allowing you to identify recurrent themes and ideas. However, care needs to be taken to ensure that word clouds are carefully checked and relevant stop lists applied to prevent warped data representations. Another drawback appears to be the reliance on seemingly unreliable technology, as demonstrated by my multiple unsuccessful attempts to export my data to Voyant Tools.

In addition to looking at the Old Bailey Online I also explored the Annotated Books Online which is part of the Utrecht University Digital Humanities Text Mining Project. This project focuses on the history of reading, by looking at annotations in early modern books. This particularly interested me as reading is such a personal experience and the notes that people make in their books offer a fascinating insight into their understanding and view of the text. (Although we hope they weren’t Library books!) Annotated Books Online (ABO) has digitised over sixty texts which are searchable using the ABO database There is a basic search function using keywords, or an advanced option to select and sort your results. The search function is not as smooth as the Old Bailey Online, and as far as I can tell you are unable to search from within the texts, but for the texts themselves, although this may come later.

Screenshot of Annotated Books Online search functions.

Screenshot of Annotated Books Online search functions.

Text Analysis

Our DITA session on Text Analysis explored the idea of ‘distant reading’, originally termed by Franco Moretti. This operates as an alternative to the traditional approach of close-reading by situating a text as apart of a data-set rather than an individual piece. Text Analysis is a quantitative study of a text or several texts which aims to highlight correlations and trends within the text(s). Text Analysis systems search large texts very quickly and can display the results in a number of ways. The results are often presented visually, as word clouds, or charts. During the Lab session we worked with a number of different analysis systems: Wordle, Many Eyes and Voyant Tools.


This tool focuses on creating word clouds, generated from the most commonly occurring words from a set of texts. There are a number of options to customise the word clouds, although these are mostly superficial, to alter the style and look: you can change the shape, angle, and colour.

Example of a Wordle ‘word cloud’



Voyant was my favourite of the systems we used, as it went further than offering just the creation of Word Clouds. It also allows you to view a summary which highlights the number of times certain words appear in the texts you have entered. Another useful tool within Voyant is the ability to apply a ‘stop list’ of words which you don’t want to appear in your summary or word clouds. This ensures that especially common words which have little or no meaning will not affect the impact of the results, for example words such as ‘and’ or ‘it’ etc. You also have the option to edit your ‘stop list’, which I found particularly helpful when analsying twitter data as it allowed me to exclude and remove certain phrases from my results which were not applicable to the actual data, such as ‘http’ and ‘mt’. Unlike Wordle, there are no options to manipulate or adjust the colour or font of the word clouds which have been created.

An example of Voyant before adding ‘’ to the list of stop words.

Another example of a Voyant text analysis.

Another example of a Voyant text analysis.


Many Eyes:

Unfortunately on the several occasions I tried I couldn’t get this analysis tool to work. From my limited experience an obvious drawback to the service is having to create an account in order to use the analysis service.


Altmetrics and Donuts

This week’s DITA lecture and lab focused on Altmetrics. Altmetrics are a non-traditional alternative to measuring the ‘success’ of an academic publication, rather than the more traditional route of citations. They aim to capture a wider variety of impact, taking into account the traditional measurements but also areas such as mentions on blogs, social media, and new media. This is useful in seeing how widely disseminated a paper may be, but there are limitations. There is no way of knowing who is reading the material and if it is reaching its intended audience, just because something is shared does not mean that it’s actually being read, and papers with ‘snappy’ or amusing titles are more likely to capture the attention of a non-academic audience. Therefore we may perceive Altmetrics to offer a valid view of a paper’s popularity, but not necessarily of it’s academic quality.

During our Lab session last week we were given access to which allowed us to delve fully into the world of altmetrics, and explore how useful they can actually be. uses the ‘altmetrics donut’ to indicate a paper’s online presence, with different colours indicating a different area of outreach (eg. blue for twitter, yellow for blogs…) The donut also conveys a paper’s ‘altmetric score’, which indicates the amount of online attention it has commanded. (As an aside, the constant mention of donuts in a lab session immediately before lunch was akin to torture…)

As this was my first time using anything like this I wasn’t entirely sure what to search for, and the amount of search fields were slightly over-whelming, but I decided to focus on looking at articles about Libraries and social media (as this is always useful for work!) Immediately I could see that if you had a particular research focus, services such as would be incredibly helpful in seeing ‘trends’ in particular research areas, and what views and opinions may be popular at any given time.

Altmetrics certainly offer a valuable insight to an academic paper’s online activity, however this does not necessarily correspond to gaining traditional citations. As Van Noorden writes:

the amount of buzz a paper gets on Twitter bears little resemblance to the impact it will have in terms of academic citations in later years.

Thus, I think Altmetrics have a clear place in our consideration of a paper, and are definitely a useful tool for researchers, but I think more work is needed before they can be considered a viable alternative to measuring impact factor.


Van Noorden, R. (2013) ‘Twitter buzz about papers does not mean citations later’ Nature {accessed 17/11/2014}

TAGS and Twitter

As with much of DITA so far, before last week’s lecture I hadn’t really considered how twitter data is collected- naively I may have even assumed it was done manually using the twitter search function (twitter fairies exist, right?) However in the DITA Lab session, we used Martin Hawksey’s TAGS application to collect tweets, and the metadata associated with them. This application allows tweets to be archived and analysed to highlight trends at certain times, e.g. #RemembranceSunday is trending on twitter today.

Using the TAGS application was relatively simple (whilst following Ernesto’s very detailed instructions…!): once a relevant hashtag has been entered, data from the past seven days is displayed in a spreadsheet allowing the user to view data in a simple and familiar way. Multiple sets of data can be archived simultaneously allowing the user to compare hashtags being used, frequency and peaks of usage. THis can then be viewed as charts and graphs or visualised, for example as Tag clouds, using the Archives and Explorer functions within TAGS.

It is important to collect twitter data as trends can be fleeting, and opinions/ commentaries can change quickly on social media as events are followed live. Key examples of this are sports games, crucial TV programmes and current social issues.

In recent weeks the outbreak of Ebola led to a huge rise in interest and discussion on this topic, particularly on social media channels. For incidents such as this which have a world wide impact, archiving twitter data is crucial in understanding public opinion and awareness of the issue, and how this can change over time. Similarly, being able to spot when activity around a topic peaks can help to pinpoint new announcements, and critical changes to an issue. I spotted a Buzzfeed article (obvisouly a reputable news source) this afternoon entitled ‘Americans have stopped caring about Ebola according to Google’. This brief article explains that google searches by Americans for Ebola appear to have declined, despite the continuation of the Ebola virus. Therefore, an analysis of social media, and particularly hashtags would prove useful in finding and highlighting a pattern of correlation or discrepancy with this article to indicate whether public interaction with the issue continues despite a reduction in google searches.

Despite having little practical experience yet working with TAGS I can understand and appreciate how useful it will be in collecting and storing data. Over the next few weeks I hope to gain more ‘hands-on’ experience with TAGS- and will report back then!

APIs and Embedding

I’ll be honest – before last weeks lecture I really didn’t have much of an idea about what APIs actually were. Even after reading around the subject prior to the DITA lecture I was struggling to fully understand and visual what they were and how they worked. As it turns out I was trying to over-complicate them, and had actually already been using them (although admittedly without being aware of this).

To put it simply an API (Application Programming Interface) hides the internal complexities of programming from a user, allowing them to actively engage with building a website in a much simpler way. This is especially useful as it allows a much wider audience to take advantage of digital technologies without having to know the intricacies of HTML.. etc.

I first unknowingly used an API to embed my twitter feed within this very blog site (neatly located on the right hand side of the screen, when viewed from a desktop device). It was surprisingly simple to do: I followed a short set of instructions and there it was. Websites such as WordPress allow their users to integrate varying types of media into their sites, whilst still maintaining ultimate control over how content is being hosted within their environment. WordPress have created a set of shortcodes to allow its users to quickly embed videos, pictures, twitter posts etc. without having to know or worry about complicated, messy code.

Therefore, to practice using these shortcodes I thought I’d embed a recent tweet I did for work (not at all trying to attract more followers…)

And finally… a cat riding a tortoise. Because it’s Monday morning and there should always be amusing cat videos! Enjoy- it’s best watched with sound!