Comparing Disciplinary Patterns

Gender and Social Networks in the Humanities through the Lens of Scholarly Communication

Daniel Burckhardt, burckhardtd@geschichte.hu-berlin.de

“don’t start with a research question”

Lev Manovich, How to do Digital Humanities Right? Herrenhausen Conference, December 2013, lab.softwarestudies.com/2013/12/how-to-do-digital-humanities-right.html

“Data Driven History”

  • Peter Haber, December 2010 weblog.hist.net/archives/4895
  • Data in the Humanities is rarely “Big Data”
  • Mindset and Toolbox of a “Data Scientist”
  • Critical Thinking of a Historian

(Initial) Data Set

  • H-Soz-Kult: over 5'000 conference reports since 1998
  • “Memory of a Discipline”
  • Expectation: Find major Topics and Trends

Technology Driven Attempts

  • Carrot2 for Clustering
  • MALLET for Topic Modelling
  • Stanford NER for Named Entity Recognition

Carrot2 for Clustering

Good: Integrated into Solr

Bad: Irritating Clusters

MALLET for Topic Modelling

Bad: Initial topics looked arbitrary

Good: Through Exploration, Tuning seems feasible

Stanford NER

Good: German Classifier

Bad: Recall too low

Sometimes a low-tech approach is more practical

H-Soz-Kult: Regular Expression

While Krupp is an example of a first mover in an infant industry, PALOMA FERNÁNDEZ PÉREZ's (Barcelona) talk—which María Fernández Moya and Hui Li wrote with her but were unable to attend—concentrated on the challenges market newcomers face when entering into the highly globalized economy
http://www.hsozkult.de/conferencereport/id/tagungsberichte-3649

Extracting Person Entities

H-Soz-Kult: Regular Expression

at least Two consecutive Words in all Uppercase including
  • German ß (GROßMANN)
  • hyphens (MEIER-MÜLLER)
  • apostrophe (O'BRIAN)
  • dot (CHRISTOPH H. F. MEYER)

\b(\p{Lu}[\p{Lu}\x{00df}-\']+\s[\p{Lu}\x{00df}-\'\s.]+[\p{Lu}\x{00df}])s?\b

Normalizing Person Entities

Name Slug

without accents, all lower case (Van vs. van), hyphen as Word-Separator

  • Xosé Manoel Núñez
  • Xosé-Manoel Núñez
  • Xose-Manoel Nunez

xose-manoel-nunez

Misspelled Names

Even without OCR, 1-2% of names wrongly spelled

  • Automated detection: Urlike (instead Ulrike)
  • Manual - Alternate spellings: Matthias / Mathias
  • Question - Spelling error or two different persons: Höppner, Annika / Höppner, Anika

Non-names

Automation seems possible but turns out to be hard

  • LINDE AG is a company / LINDE NG is a person
  • BRAGE BEI DER WIEDEN / BERGE AN DER VIA REGIA

Known Issues

  • Changing names over lifetime: Jeannette Madarasz / Jeannette Madarasz-Lebenhagen
  • Cyrillic transcriptions into German / English: -ow / -ev
  • Variant usage of initials / multiple names:
    Patel, Kiran Klaus / Patel, Kiran
    Miller, Michael / Miller, Michael B.

Known Issues

No attempt to differentiate persons with the same exact name (< 0.5%):

  • ANDREAS SCHNEIDER (Berlin) vs. ANDREAS SCHNEIDER (Meiningen)
  • HARALD MÜLLER (medieval history) vs. HARALD MÜLLER (law)

Extracting Person Entities

H-ArtHist: AlchemyAPI

H-ArtHist: AlchemyAPI

CONF: Arts, Humanities, and Complex Networks (Copenhagen, 4 Jun 13)

Extracted (23): Ahnert, Ruth; Allen, Jamie; Amster, Pablo; Arends, Max; Barabási, László; Berchum, Van; [..]

Wrong (Precision) (3): Berchum, Van (instead of Van Berchum, Marnix), Collectivizing, Dieter Merkl (instead of Merkl, Dieter), Pablo Rodríguez (instead of Pablo Rodríguez Zivic)

Missed (Recall) (2): Gresham-Lancaster, Scot; Horvat, Emoke-Agnes

Initial Results

Period Messages Monthly Avg.
H-Soz-Kult January 2008 - Summer 2014 3,500 conf. reports 43.6
H-ArtHist January 2011 - Summer 2015 2,000 conf. ann. 35.2

Over Time

H-Soz-Kult

H-ArtHist

Persons per Message

Messages Persons Median
H-Soz-Kult 3,537 30,502 13
H-ArtHist 2,045 22,000 14

Messages per Persons

Genderizing Speakers

Motivation: Use genderize.io to detect non-person entities like BMW AG (Company Name) and BERGE AN DER VIA REGIA (Section Title in all Caps)

Result: Gender of the Invisible College in German History

„Ist Spitzenforschung männlich?“

Diskussionsrunde, Leibniz-Fest am 25. März 2015 in Bonn

Der Einbruch folgt in der Postdoc-Phase, wenn die Familienplanung ansteht, das Wissenschaftssystem aber nur Ausblicke ins Ungewisse bietet. Akademikerinnen, die sich bis vierzig durch befristetete und schlecht bezahlte Drittelmittelprojekte[sic] hangeln müssen, sind oft zur Wahl zwischen Kind und Karriere gezwungen. (FAZ, 1. April 2015)

Barbara Stollberg-Rilinger

Genderizing H-ArtHist

GND-izing H-ArtHist

  • Organization
  • Popularity as Filter

SNA: From Publication to Conference Data

Derek de Solla Price: Networks of Scientific Papers, in: Science, 149(3683) : 510-515, July 30, 1965

Citation Analysis in the Humanities

Arts and Humanities Citation Index

Citation Analysis in the Humanities

Google Scholar

Co-authorship Analysis

Derek de Solla Price and Donald Beaver: Collaboration in an Invisible College, in: Am. Psychologist 21: 1011–1018, 1966

The basic phenomenon seems to be that in each of the more actively pursued and highly competitive specialties in the sciences there seems to exist an "ingroup."
De Stefano et al. 2011 Avg. authors per paper
Physics 7.1
Art & Humanities 1.2

Co-authorship Analysis is a Means to an End

Price and Beaver 1966

The body of people meet in select conferences (usually held in rather pleasant places), they commute between one center and another, and they circulate preprints and reprints to each other and they collaborate in research.

Goal of our SNA

Uncover the Invisible College(s) through Conference Reports and Announcements

The Networks

Bipartite Network(s)

The Networks

Networks between persons appearing at least twice

Nodes Edges
H-Soz-Kult 8,361 136,972
H-ArtHist 4,185 47,884

Eigenvector Centrality

  • Wolfgang Behringer (Early Modern, 10 Reports)
  • Jens Jäger (Visual History, 21 Reports)

“Elite Network”

Conference “Buddies”

“Socially Similar” conferences: Arts, Humanities, and Complex Networks (Copenhagen, 4 Jun 13)

Outlook

Long-Term Analysis

  • Measures for Individual Careers
  • New Topics need a new Generation?

Fusion between Content and Social Analysis

  • Correlation between “social” and document similarity?
  • Characteristic speakers for certain topics

The Question of Privacy

The wish not to be measured

Quantifizierungstendenzen, [...] Bibliometrie: Was ich eine „Pest“ finde (Barbara Stollberg-Rilinger)

Legal and Ethical Barriers

  • Public Data
  • Aggregation and Linking
  • Transition from Quantity to Quality