My title
Chris Culy
Visualization of text data in the Humanities
University of Hamburg
14 November 2013

My interest in language and visualization


Bambara: malonyinina o malonyinina
Fulfulde: mi yi'ii ɗum e giɗum
Donno Sɔ Dogon: Oumar Anta inyemɛñ waa be gi

Source: OpenStreetMap

My interest in language and visualization

Takelma texts

Now those just scattered off, Grizzly Bear did chase the people around. Now this Coyote, for his part, did run off with the chieftainess girl. Then, 'tis said, after a little while, "Are you a female? It must be a female," he thought; Coyote now, for his part, did wish to sleep with her. Tunc nihil vulvae repperit. "What did I, for my part, (take)? That you were a woman I thought," he said to her. Coyote threw Frog into the water. "Do you think you will be a woman? Frog you will always be called," he said to Frog. Proceeding just up to there (it goes). 'Tis finished. Go gather and eat your ba|ap'-seeds.

Source: Takelma Texts

My interest in language and visualization

Recipes, letters


Bake Ø until done


EBB: Overjoyed I am

My interest in language and visualization

Visualizations as tools

Source: C. Culy, E. Chiocchetti, and N. Ralli. 2013

3 Separate goals of visualizations

For 2. Comprehension, show continuity
For 3. Communications, filter docs to 0.9 and 0.1
Remember to show original letter
Remember to mouse over for more info.

First words

Which visualizations are best suited for which of these goals?

Finding the right visualization

  1. Grinstein's Grand Challenge

  2. Pre-existing first steps

    Tableau, Spotfire

  3. Limitations

    • Limited types of data: numbers, dates, geographic, categories
    • No notion of task
    • No notion of preferences

Sources: Georges Grinstein, Tableau, Spotfire

Tackling the Grand Challenge

Person-oriented correspondence

Sources: on, on on Wikipedia

CorporaGenre (Bakhtin)
Individual items(Complex) Utterances
Type(s) of the itemsUtterance types
Single corpus characterized by a unifying factorInstantiation of a genre
Category of corpora characterized by an abstaction of a unifying factorGenre: the collection of utterances used in a sphere of communication

Dataset genres

Source: M. Bakhtin “The Problem of Speech Acts”. Thanks to Yulia Svetashova and the members of the class Development of applications using NLP tools

Theoretical issues

( Is there a better name than dataset genre? )

Thanks to Prof. Hilary Nesi and the members of the class Development of applications using NLP tools

Some other possible (corpora) dataset genres

( What other dataset genres do we have? )

Thanks to the members of the class Development of applications using NLP tools

Language is different

  1. Language is not mappable

  2. Individual pieces of data are meaningful

  3. Much linguistic data is computed, not observed

Uses: DoubleTreeJS

Explain color coding
Expand left "I"
Resort by POS
Recenter on Tennyson
Select i/ as root : comment on 7300+ as pronoun, 23 as common noun then
Select i/NN as root, then show KWIC for "this" on the left

Concerning data

( What types of data are especially relevant? )

What specialized visualizations are especially relevant to the data?

( How important are data uncertainty and data errors? )

What should we do in the visualization about uncertainty in the data?

What kinds of mismatches are there between the original data models and the visualization data models?

Most of requests for changes in DoubleTreeJS are about the data model, not the vis

An aside: Challenge!

after the end of seventy years shall Tyre sing as an harlot.
For this cause I will confess to thee among the Gentiles, and sing unto thy name.
I caused the widow's heart to sing for joy.

Source: Young's Concordance to the Bible

TaskProposed by
Zoom, Abstract/Elaborate
Filter, Select, Selection
Relate, Connect, Comparing
Extract, Sampling
Explore, Discovering
Encode, Representing
Referring, linking

Sources: Shneiderman 1996, Keim et al. 2006, Yi et al. 2007, Unsworth 2000; Many Eyes

Ji Soo Yi (John Stasko), John Unsworth
Relate/Connect: example is recentering tree on new word
Reconfigure: example is sorting tree
Encode: example is show KWIC from tree
Extract: subpieces, query params, for later use
Emphasize annotation?
Comparison (what else from Unsworth?)

A question about tasks

( What other relevant tasks are there? )


References: Bamman et al. 2007, Passarotti 2013

Visualizations for tasks and users

Which visualization aspects are primarily and secondarily user preferences?

How are conflicts between tasks and user preferences handled?

Uses: ProD

Building visualizations

Resusability: generalizability

Uses: DoubleTreeJS

Language visualizations as components

Uses: DoubleTreeJS

Questions about reusability

What are the data properties that make a given visualization relevant for the data?

What are the fundamental properties and actions of visualizations that form the basis for reusable components?

References: Grinstein's WEAVE, Stasko's Jigsaw

How to get going

  1. Think through the data, both the raw and the calculated

  2. Think about what you want the visualization to help you do

  3. Don't start thinking about visualizations too early

    Or the details

  4. Get a wide view of the kinds of visualizations that are possible

    Sample visualizations listed at tapor; others by Mike Bostock in D3 and in Protovis

  5. Don't overlook the basics, e.g. charts

Evaluating visualizations

Does it do what you need?

Final words

I'm really excited about the prospects, because —

Visualizations put ideas into our heads!

Thank You