Obtaining Data

Module 02 GESIS Fall Seminar “Introduction to Computational Social Science”

Johannes B. Gruber

GESIS

John McLevey

University of Waterloo

Introduction

Schedule: GESIS Fall Seminar in Computational Social Science

Course Schedule
time Session
Day 1 Introduction to Computational Social Science
Day 2 Obtaining Data
Day 3 Computational Network Analysis
Day 4 Computational Text Analysis
Day 5 Large Language Models in the Social Sciences

The Plan for Today

  • Learn what Web Scraping is
  • Get an understanding of the web
  • Learn how to identify patterns you can use for scraping
  • Get an overview of relevant tools
  • Learn about legal and ethical concerns (and myths)

Louis Hansel via unsplash.com

Found vs Designed Data

Designed Data

  • collected for research
  • full control of shape and form
  • problems of validity due to social desirability and imperfect estimation problems

Found Data

  • traces of human behaviour
  • comes in all shapes and forms
  • problems of validity as not representative and incomplete access

What is Web Scraping & should You Learn/Use it?

What is Web Scraping

  • Used when other means are unavailable
  • Scrape the (unstructured) Data
  • A web-scraper is a program (or robot) that:
    • goes to a web page
    • downloads its content
    • extracts data from the content
    • then saves the data to a file or a database
  • Unfortunately no one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
    • Some websites make it hard for you (by accident or on purpose!)

Image Source: daveberesford.co.uk

Image Source: daveberesford.co.uk

What is Web Scraping

  • Used when other means are unavailable
  • Scrape the (unstructured) Data
  • A web-scraper is a program (or robot) that:
    • goes to a web page
    • downloads its content
    • extracts data from the content
    • then saves the data to a file or a database
  • Unfortunately no one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
    • Some websites make it hard for you (by accident or on purpose!)

What is Web Scraping

  • Used when other means are unavailable
  • Scrape the (unstructured) Data
  • A web-scraper is a program (or robot) that:
    • goes to a web page
    • downloads its content
    • extracts data from the content
    • then saves the data to a file or a database
  • Unfortunately no one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
    • Some websites make it hard for you (by accident or on purpose!)

Web Scraping: A Three-Step Process

  1. Send an HTTP request to the webpage -> server responds to the request by returning (HTML) content
  2. Parse the HTML content -> extract the information you want from the nested structure of (HTML) code
  3. Wrangle the data into a useful format

Original Image Source: prowebscraper.com

Original Image Source: prowebscraper.com

Why Should You Learn Web Scraping?

  • The internet is a data gold mine!
  • Data were not created for research, but are often traces of what people are actually doing on the internet
  • Reproducible and renewable data collection (e.g., rehydrate data that is copyrighted)
  • Web Scraping let’s you automate data retrieval (as opposed to using tedious copy & past on some web site)
  • It’s one of the most fun tasks to learn R and programming!
    • It’s engaging and satisfying to find repeating patterns that you can employ to structure data (every website becomes a little puzzle)
    • It touches on many important computational skills
    • The return is good data to further your career (unlike sudokus or video games)

What are HTML and CSS

What is HTML

  • HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser
  • Contains the raw data (text, URLs to pictures and videos) plus defines the layout and some of the styling of text

Image Source: Wikipedia.org

Example: Simple

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With headline and author

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author" href="https://www.johannesbgruber.eu/">Me</p>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With some data

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this data:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>John</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Mary</td>
            <td>26</td>
        </tr>
    </table>
</body>
</html>

Browser View:

Example: With an image

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this image:</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog."></img>
</body>
</html>

Browser View:

What is CSS

  • CSS (Cascading Style Sheets) is very often used in addition to HTML to control the presentation of a document
  • Designed to enable the separation of content from things concerning the look, such as layout, colours, and fonts.
  • The reason it is interesting for web scraping is that certain information often get the same styling

Example: CSS

HTML:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
    <link rel="stylesheet" type="text/css" href="example.css">
</head>
<body>
  <h1 class="headline">My Headline</h1>
  <p class="author">Me</p>
  <div class="content">
    <p>This is the body of the text.</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
    <p>Consider this data:</p>
    <table>
      <tr class="top-row">
          <th>Name</th>
          <th>Age</th>
      </tr>
      <tr>
          <td>John</td>
          <td>25</td>
      </tr>
      <tr>
          <td>Mary</td>
          <td>26</td>
      </tr>
    </table>
  </div>
</body>
</body>
</html>

CSS:

/* CSS file */

.headline {
  color: red;
}

.author {
  color: grey;
  font-style: italic;
  font-weight: bold;
}

.top-row {
  background-color: lightgrey;
}

.content img {
  border: 2px solid black;
}

table, th, td {
  border: 1px solid black;
}

Browser View:

HTML and CSS in Web Scraping

Using HTML tags:

You can select HTML elements by their tags

library(rvest)
read_html("data/example.html") |>  # retrieve content
  html_elements("p") |>                    # select content via css selector
  html_text2()                             # extract data you want
[1] "Me"                            "This is the body of the text."
[3] "Consider this image:"          "Consider this data:"          
  • to select them, tags are written without the <>
  • in theory, arbitrary tags are possible, but commonly people use <p> (paragraph), <br> (line break), <h1>, <h2>, <h3>, … (first, second, third, … level headline), <b> (bold), <i> (italic), <img> (image), <a> (hyperlink), and a couple more.

Using attributes

You can select elements by an attribute, including the class:

read_html("data/example.html") |> 
  html_element("[class=\"headline\"]") |> 
  html_text()
[1] "My Headline"

For class, there is also a shorthand:

read_html("data/example.html") |> 
  html_element(".headline") |> 
  html_text()
[1] "My Headline"

Another important shorthand is #, which selects the id attribute:

read_html("data/example.html") |> 
  html_element("#table-1") |> 
  html_table()                     # html_table tries to re-assemble tables 
# A tibble: 2 × 2
  Name    Age
  <chr> <int>
1 John     25
2 Mary     26

Extracting attributes

Instead of selecting by attribute, you can also extract one or all attributes:

read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"   "https://en.wikipedia.org/wiki/Dog"
read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attrs()
[[1]]
                             href 
"https://www.johannesbgruber.eu/" 

[[2]]
                               href 
"https://en.wikipedia.org/wiki/Dog" 

Chaining selectors

If there is more than one element that fits your selector, but you only want one of them, see if you can make your selection more specific by chaining selectors with > (for the immediate next one) or an empty space (for any children of an element):

read_html("data/example.html") |> 
  html_elements(".author>a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"
read_html("data/example.html") |> 
  html_elements(".author a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Tip: there is also no rule against doing this instead:

read_html("data/example.html") |> 
  html_elements(".author") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Common Selectors

There are quite a lot of CSS selectors, but often you can stick to just a few:

selector example Selects
element/tag table all <table> elements
class .someTable all elements with class="someTable"
id #table-1 unique element with id="table-1"
element.class tr.headerRow all <tr> elements with the headerRow class
class1.class2 .someTable.blue all elements with the someTable AND blue class
class1 > tag .table-1 > tr all elements with tr with .table-1 as parent
class1 + tag .top-row + tr first elements with tr following .top-row

Family Relations

Each html tag can contain other tags. To keep track of the relations we speak of ancestors, descendants, parents, children and siblings.

<book>
  <chapter>
    <section>
      <subsection>
        This is a subsection.
      </subsection>
      <subsection>
        This is another subsection.
      </subsection>
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    <section>
      This is a section.
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    This is a chapter without sections.
  </chapter>
</book>

Scraping Static Web Pages

Example: World Happiness Report

Use your Browser to Scout

Use your Browser’s Inspect tool

Note: Might not be available on all browsers; use Chromium-based or Firefox or enable in Safari.

Use rvest to scrape

library(rvest)
library(tidyverse)

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1165407285")

# 2. Parse
happy_table <- html |> 
  html_elements(".wikitable") |> # select the right element
  html_table() |>                # special function for tables
  pluck(3)                       # select the third table

# 3. No wrangling necessary
happy_table
# A tibble: 153 × 9
   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
            <int> <chr>               <dbl>            <dbl>            <dbl>
 1              1 Finland              7.81             1.28             1.5 
 2              2 Denmark              7.65             1.33             1.50
 3              3 Switzerland          7.56             1.39             1.47
 4              4 Iceland              7.50             1.33             1.55
 5              5 Norway               7.49             1.42             1.50
 6              6 Netherlands          7.45             1.34             1.46
 7              7 Sweden               7.35             1.32             1.43
 8              8 New Zealand          7.3              1.24             1.49
 9              9 Austria              7.29             1.32             1.44
10             10 Luxembourg           7.24             1.54             1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancy
ggplot(happy_table, aes(x = `GDP per capita`, y = `Healthy life expectancy`)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Example: UK prime ministers on Wikipedia

Use your Browser to Scout

Use rvest to scrape

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it

# 2. Parse
pm_table <- html |> 
  html_element(".wikitable:contains('List of prime ministers')") |>
  html_table() |> 
  as_tibble(.name_repair = "unique") |> 
  filter(!duplicated(`Prime ministerOffice(Lifespan)`))

# 3. No wrangling necessary
pm_table
# A tibble: 75 × 11
   Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
   <chr>        <chr>        <chr>                          <chr>               
 1 "Portrait"   "Portrait"   Prime ministerOffice(Lifespan) start               
 2 "​"           ""           Robert Walpole[27]MP for King… 3 April1721         
 3 "​"           ""           Spencer Compton[28]1st Earl o… 16 February1742     
 4 "​"           ""           Henry Pelham[29]MP for Sussex… 27 August1743       
 5 "​"           ""           Thomas Pelham-Holles[30]1st D… 16 March1754        
 6 "​"           ""           William Cavendish[31]4th Duke… 16 November1756     
 7 "​"           ""           Thomas Pelham-Holles[32]1st D… 29 June1757         
 8 ""           ""           John Stuart[33]3rd Earl of Bu… 26 May1762          
 9 ""           ""           George Grenville[34]MP for Bu… 16 April1763        
10 ""           ""           Charles Watson-Wentworth[35]2… 13 July1765         
# ℹ 65 more rows
# ℹ abbreviated name: ¹​`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
#   `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
#   Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
  <span class="anchor" id="18th_century"></span>
   <b>
     <a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
   </b>
   <sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
     <a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
   </sup>
   <br>
   <span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
   <br>(1676–1745)
  </span>
</td>
links <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_attr("href")
title <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_text()
tibble(name = title, link = links)
# A tibble: 90 × 2
   name                 link                                             
   <chr>                <chr>                                            
 1 Robert Walpole       /wiki/Robert_Walpole                             
 2 George I             /wiki/George_I_of_Great_Britain                  
 3 George II            /wiki/George_II_of_Great_Britain                 
 4 Spencer Compton      /wiki/Spencer_Compton,_1st_Earl_of_Wilmington    
 5 Henry Pelham         /wiki/Henry_Pelham                               
 6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 7 William Cavendish    /wiki/William_Cavendish,_4th_Duke_of_Devonshire  
 8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 9 George III           /wiki/George_III                                 
10 John Stuart          /wiki/John_Stuart,_3rd_Earl_of_Bute              
# ℹ 80 more rows

Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work

Other Techniques

APIs

  • An Application Programming Interface (API) is a way for two computer programs to speak to each other
  • Commonly used to distribute data or do many other things (e.g., the defunct Twitter and Facebook APIs, NYT and Guardian APIs, MediaCloud API)
  • Good way to access APIs: httr2

API example: Guardian API

If you want to follow along:

library(httr2)
library(tidyverse, warn.conflicts = FALSE)
req <- request("https://content.guardianapis.com") |>  # start the request with the base URL
  req_url_path("search") |>                            # navigate to the endpoint you want to access
  req_method("GET") |>                                 # specify the method
  req_timeout(seconds = 60) |>                         # how long to wait for a response
  req_headers("User-Agent" = "httr2 guardian test") |> # specify request headers
  # req_body_json() |>                                 # since this is a GET request the body stays empty
  req_url_query(                                       # instead the query is added to the URL
    q = "parliament AND debate",
    "show-blocks" = "all"
  ) |>
  req_url_query(                                       # in this case, the API key is also added to the query
    "api-key" = "d187828f-9c6a-4c29-afd4-dbd43e116965"             # but httr2 also has req_auth_* functions for other
  )                                                    # authentication procedures
print(req)
<httr2_request>
GET https://content.guardianapis.com/search?q=parliament%20AND%20debate&show-blocks=all&api-key=d187828f-9c6a-4c29-afd4-dbd43e116965
Headers:
* User-Agent: "httr2 guardian test"
Body: empty
Options:
* timeout_ms    : 60000
* connecttimeout: 0

Nothing is done until you perform the request:

resp <- req |> 
  req_perform()

Then you need to parse the response:

parse_response <- function(resp) {
  # make sure response is valid
  if (resp_content_type(resp) != "application/json") {
    stop("Request was not succesful!")
  }
  
  # extract articles
  results <- resp_body_json(resp) |> 
    pluck("response", "results")
  
  # parse into data.frame
  map(results, function(res) {
    tibble(
      id = res$id,
      type = res$type,
      time = lubridate::ymd_hms(res$webPublicationDate),
      headline = res$webTitle,
      text = rvest::read_html(pluck(res, "blocks", "body", 1, "bodyHtml")) |> rvest::html_text2()
    )
  }) |> 
    bind_rows()
  
}
parse_response(resp)
# A tibble: 10 × 5
   id                                   type  time                headline text 
   <chr>                                <chr> <dttm>              <chr>    <chr>
 1 australia-news/2025/aug/08/dissent-… arti… 2025-08-07 22:08:45 Dissent… "Sim…
 2 world/2024/dec/10/queensland-parlia… arti… 2024-12-10 03:48:32 Queensl… "The…
 3 australia-news/2025/aug/25/coalitio… arti… 2025-08-24 15:00:36 Coaliti… "The…
 4 politics/2025/jul/03/welfare-reform… arti… 2025-07-03 17:05:16 Welfare… "The…
 5 society/2025/may/13/assisted-dying-… arti… 2025-05-13 15:55:00 Give te… "Sco…
 6 world/2025/jun/25/irans-parliament-… arti… 2025-06-25 17:27:30 Iran’s … "Ira…
 7 world/2025/aug/05/germany-retiremen… arti… 2025-08-05 04:00:41 Pension… "The…
 8 world/2025/jan/31/german-parliament… arti… 2025-01-31 17:19:30 German … "The…
 9 politics/2025/sep/01/forget-orwells… arti… 2025-09-01 17:37:13 Forget … "Wha…
10 uk-news/2025/apr/12/uk-mps-tweet-to… arti… 2025-04-12 14:44:41 UK MPs … "For…

Special Requests

  • Some websites limit requests
  • When you run read_html from rvest, it uses a default request that fits most of the time, but not always:
html <- read_html("https://www.icahdq.org/mpage/ICA23-Program")
Error in open.connection(x, "rb"): cannot open the connection

To interpret HTTP errors, you can use this handy function:

error_cat <- function(error) {
  link <- paste0("https://http.cat/images/", error, ".jpg")
  knitr::include_graphics(link)
}
error_cat(403)

So what to do next?

  • Scope the Network tab
  • Translate Curl
  • Build request in R

Translate the cURL call

curl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/") |>
  req_url_query(
    event_id = "JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4=",
  ) |>
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()

Make request in R

ica_programme_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Pragma = "no-cache",
    Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
    `Sec-Fetch-Dest` = "empty",
    `Sec-Fetch-Mode` = "cors",
    `Sec-Fetch-Site` = "same-origin",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) |> 
  req_perform() |> 
  resp_body_json()
object.size(ica_programme_data) |> 
  format("MB")
[1] "6.7 Mb"

It worked!

Special Requests: Behind Paywall

Let’s get this cool data journalism article.

html <- read_html("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende")
html |> 
  html_elements(".article-body p") |> 
  html_text2()
[1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."
[2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                

🤔 Wait, that’s only the first two paragraphs!

Special Requests: Behind Paywall Cookies!

library(cookiemonster)
add_cookies("cookies.txt")
html <- request("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende") |> # start a request
  req_options(cookie = get_cookies("zeit.de", as = "string")) |> # add cookies to be sent with it
  req_perform() |> 
  resp_body_html() # extract html from response

html |> 
  html_elements(".article-body p") |> 
  html_text2()

Interactive Website

static <- read_html("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")
static |> 
  html_elements(".Fk3sm") |> 
  html_text2()
character(0)

google maps commute

google maps commute

Interactive Website & Browser Automation

  • The new read_html_live from rvest solves this by emulating a browser:
# loads a real web browser
sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

# you can even take a look at what is happening with
# sess$view()
# cookies <- sess$session$Network$getCookies()
# saveRDS(cookies, "data/chromote_cookies.rds")
cookies <- readRDS("data/chromote_cookies.rds")
sess$session$Network$setCookies(cookies = cookies$cookies)

# the session behaves like a normal rvest html object
sess |> 
  html_elements(".Fk3sm") |> 
  html_text2() |> 
  str_extract(".+?min")

Some of my other packages that can make your life easier

paperboy: get data from news media sites

paperboy::pb_deliver("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende",
                     use_cookies = TRUE)
# A tibble: 1 × 9
  url       expanded_url domain status datetime author headline text 
  <chr>     <chr>        <chr>   <int> <dttm>   <chr>  <chr>    <chr>
1 https://… https://log… zeit.…    200 NA       NA     <NA>     ""   
# ℹ 1 more variable: misc <list>

traktok: easy access to TikTok data

library(traktok)
df <- tt_search_hidden("#rstats", max_pages = 2)
df
tt_videos_hidden(df$video_url[1])

Should you use Webscraping?

Are you Allowed to use Webscraping?

Web Scraping is not a shady or illegal activity, but not all web scraping is unproblematic and the data does not become yours.

  • Collecting personal data of people in the EU might violate GDPR (General Data Protection Regulation)
    • The GDPR defines personal data as “any information relating to an identified or identifiable natural person.” (Art. 4 GDPR)
    • Exceptions
      • if you get consent from the people whose data it is
      • personal data processing is legitimate when “necessary for the performance of a task carried out in the public interest” (Art. 6 GDPR)
  • Collecting copyrighted data
    • Complicated legal situation
    • Public facing content is probably okay (9th circuit ruling)
    • “there have been no lawsuits in […] major western democratic countries stemming from a researcher scraping publicly accessible data from a website for personal or academic use.” (Luscombe, Dick, and Walby 2022)
    • You will probably get in trouble if you distribute the material
  • Honouring Terms of Service and robots.txt
    • Many companies have ToS that might prohibit you from scraping (these are not laws, might not be binding and whether they can be enforced is a separate question)
    • /robots.txt is often where guidelines are communicated to automated crawlers

ToS and Robots.txt

Twitter ToS

User-agent: *                         # the rules apply to all user agents
Disallow: /EPiServer/CMS/             # do not crawl any URLs that start with /EPiServer/CMS/
Disallow: /Util/                      # do not crawl any URLs that start with /Util/ 
Disallow: /about/art-in-parliament/   # do not crawl any URLs that start with /about/art-in-parliament/

https://www.parliament.uk/robots.txt

Ethical

  • Are there other means available to get to the data (e.g., via an API)?
  • robots.txt might not be legally binding, but it is not nice to ignore it
  • Scraping can put a heavy load on a website (if you make 1000s of requests) which costs the hosts money and might bring down a site (DDoS attack)
  • Think twice before scraping personal data. You should ask yourself:
    • is it necessary for your research?
    • are you harming anyone by obtaining (or distributing) the data?
    • do you really need everything or are parts of the data sufficient (e.g., can you preselect cases or ignore variables)

Advice?

Legal and ethical advice is rare and complicated to give. A good opinion piece about it is Freelon (2018). It is worth reading, but can be summarised in three general pieces of advice

  • use authorized methods whenever possible
  • do not confuse terms of service compliance with data protection
  • understand the risks of violating terms of service

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] httr2_1.2.1     lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1  
 [5] dplyr_1.1.4     purrr_1.1.0     readr_2.1.5     tidyr_1.3.1    
 [9] tibble_3.3.0    ggplot2_3.5.2   tidyverse_2.0.0 rvest_1.0.4    

loaded via a namespace (and not attached):
 [1] paperboy_0.0.7.9000 gtable_0.3.6        xfun_0.52          
 [4] websocket_1.4.4     processx_3.8.6      lattice_0.22-7     
 [7] callr_3.7.6         tzdb_0.5.0          vctrs_0.6.5        
[10] tools_4.5.1         ps_1.9.1            generics_0.1.4     
[13] curl_6.4.0          adaR_0.3.4          pkgconfig_2.0.3    
[16] Matrix_1.7-3        RColorBrewer_1.1-3  lifecycle_1.0.4    
[19] compiler_4.5.1      farver_2.1.2        chromote_0.5.1     
[22] codetools_0.2-20    htmltools_0.5.8.1   yaml_2.3.10        
[25] later_1.4.2         pillar_1.11.0       openssl_2.3.3      
[28] nlme_3.1-168        tidyselect_1.2.1    digest_0.6.37      
[31] stringi_1.8.7       labeling_0.4.3      splines_4.5.1      
[34] fastmap_1.2.0       grid_4.5.1          cli_3.6.5          
[37] magrittr_2.0.3      triebeard_0.4.1     utf8_1.2.6         
[40] withr_3.0.2         scales_1.4.0        promises_1.3.3     
[43] rappdirs_0.3.3      timechange_0.3.0    rmarkdown_2.29     
[46] httr_1.4.7          cookiemonster_0.0.3 askpass_1.2.1      
[49] hms_1.1.3           evaluate_1.0.4      knitr_1.50         
[52] mgcv_1.9-3          rlang_1.1.6         Rcpp_1.1.0         
[55] docopt_0.7.2        glue_1.8.0          selectr_0.4-2      
[58] xml2_1.3.8          jsonlite_2.0.0      R6_2.6.1           

References

Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
Luscombe, Alex, Kevin Dick, and Kevin Walby. 2022. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1023–44. https://doi.org/10.1007/s11135-021-01164-0.