A Comprehensive Collection of News Media Scrapers • paperboy

The philosophy of paperboy is that the package is a comprehensive collection of webscraping scripts for news media sites. Many data scientists and researchers write their own code when they have to retrieve news media content from websites. At the end of research projects, this code is often collecting digital dust on researchers hard drives instead of being made public for others to employ. paperboy offers writers of webscraping scripts a clear path to publish their code and earn co-authorship on the package (see For developers Section). For users, the promise is simple: paperboy delivers news media data from many websites in a consistent format. Check which domains are already supported in the table below or with the command pb_available().

Installation

paperboy is not on CRAN yet. Install via remotes (first install remotes via install.packages("remotes"):

remotes::install_github("JBGruber/paperboy")

For Users

Say you have a link to a news media article, for example, from mediacloud.org. Simply supply one or multiple links to a media article to the main function, pb_deliver:

library(paperboy)
df <- pb_deliver("https://tinyurl.com/386e98k5")
df

url	expanded_url	domain	status	datetime	author	headline	text	misc
https://tinyurl.com/386e98k5	https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer	theguardian.com	200	2021-07-12 12:00:13	https://www.theguardian.com/profile/stuart-heritage	’A woman trapped in an…	In the Guide’s weekly Solved!…	news , https://i.guim.co.uk/img/media/aa01cd463d4217fff7e6d7c00cd744fa3665b520/226_128_3198_1919/master/3198.jpg?width=465&dpr=1&s=none&crop=none ,

The returned data.frame contains important meta information about the news items and their full text. Notice, that the function had no problem reading the link, even though it was shortened. paperboy is an unfinished and highly experimental package at the moment. You will therefore often encounter this warning:

pb_deliver("google.com")
#> ! No parser for domain google.com yet, attempting generic approach.

url	expanded_url	domain	status	datetime	author	headline	text	misc
google.com	http://www.google.com/	google.com	200	NA	NA	Google	© 2024 - Datenschutzerklrung - Nutzungsbedingungen	NULL

The function still returns a data.frame, but important information is missing — in this case because it isn’t there. The other URLs will be processed normally though. If you have a dead link in your url vector, the status column will be different from 200 and contain NAs.

If you are unhappy with results from the generic approach, you can still use the second function from the package to download raw html code and later parse it yourself:

pb_collect("google.com")

url	expanded_url	domain	status	content_raw
google.com	http://www.google.com/	google.com	200	<!doctype html><html itemscope…

pb_collect uses concurrent requests to download many pages at the same time, making the function very quick to collect large amounts of data. You can then experiment with rvest or another package to extract the information you want from df$content_raw.

For developers

If there is no scraper for a news site and you want to contribute one to this project, you can become a co-author of this package by adding it via a pull request. First check available scrapers and open issues and pull requests. Open a new issue or comment on an existing one to communicate that you are working on a scraper (so that work isn’t done twice). Then start by pulling a few articles with pb_collect and start to parse the html code in the content_raw column (preferably with rvest).

Every webscraper should retrieve a tibble with the following format:

url	expanded_url	domain	status	datetime	headline	author	text	misc
character	character	character	integer	as.POSIXct	character	character	character	list
the original url fed to the scraper	the full url	the domain	http status code	publication datetime	the headline	the author	the full text	all other information that can be consistently found on a specific outlet

Since some outlets will give you additional information, the misc column was included so these can be retained.

Available Scrapers

domain	author	issues
3sat.de	@schochastics	#23
abendblatt.de	@schochastics	#23
abendzeitung-muenchen.de.de	@schochastics	#23
ac24.cz	@JBGruber
ad.nl	@JBGruber
aktualne.cz	@JBGruber
anotherangryvoice.blogspot.com	@JBGruber
augsburger-allgemeine.de	@schochastics	#23
badische-zeitung.de	@schochastics	#23
bbc.co.uk	@JBGruber
berliner-kurier.de	@schochastics	#23
berliner-zeitung.de	@schochastics	#23
bild.de	@schochastics	#23
blesk.cz	@JBGruber
bnn.de	@schochastics	#23
boston.com	@JBGruber	#1
bostonglobe.com	@JBGruber	#1
br.de	@schochastics	#23
breakingnews.ie	@JBGruber
breitbart.com	@JBGruber
businessinsider.de	@schochastics	#23
buzzfeed.com	@JBGruber
cbsnews.com	@JBGruber
ceskatelevize.cz	@JBGruber
cnet.com	@JBGruber
cnn.com	@JBGruber
courier-journal.com	@JBGruber
dailymail.co.uk	@JBGruber
decider.com		#1
democratandchronicle.com	@JBGruber
denikn.cz	@JBGruber
der-postillon.com	@schochastics	#23
derstandard.at	@schochastics	#23
derwesten.de	@schochastics	#23
deutschlandfunk.de	@schochastics	#23
deutschlandfunkkultur.de	@schochastics	#23
dnn.de	@schochastics	#23
echo24.de	@schochastics	#23
epochtimes.de	@schochastics	#23
eu.usatoday.com	@JBGruber
evolvepolitics.com	@JBGruber
express.de	@schochastics	#23
faz.net	@JBGruber
finanzen.net	@schochastics	#23
fnp.de	@schochastics	#23
focus.de	@schochastics	#23
forbes.com	@JBGruber	#2
fortune.com		#1
foxbusiness.com	@JBGruber
foxnews.com	@JBGruber
fr.de	@schochastics	#23
frankenpost.de	@schochastics	#23
freiepresse.de	@schochastics	#23
ftw.usatoday.com	@JBGruber
geenstijl.nl	@JBGruber
handelsblatt.com	@schochastics	#23
haz.de	@schochastics	#23
heidelberg24.de	@schochastics	#23
heise.de	@schochastics	#23
hn.cz	@JBGruber
hna.de	@schochastics	#23
huffingtonpost.co.uk	@JBGruber
huffpost.com	@JBGruber
idnes.cz	@JBGruber
independent.co.uk	@JBGruber
independent.ie	@JBGruber
infranken.de	@schochastics	#23
irishexaminer.com	@JBGruber
irishmirror.ie	@JBGruber
irishtimes.com	@JBGruber
irozhlas.cz	@JBGruber
joe.ie	@JBGruber
jungefreiheit.de	@schochastics	#23
kabeleins.de	@schochastics	#23
karlsruhe-insider.de	@schochastics	#23
kreiszeitung.de	@schochastics	#23
ksta.de	@schochastics	#23
kurier.at	@schochastics	#23
latimes.com	@JBGruber
lidovky.cz	@JBGruber
lvz.de	@schochastics	#23
manager-magazin.de	@schochastics	#23
marketwatch.com	@JBGruber
maz-online.de	@schochastics	#23
mdr.de	@schochastics	#23
mediacourant.nl	@JBGruber
merkur.de	@schochastics	#23
metronieuws.nl	@JBGruber
mopo.de	@schochastics	#23
morgenpost.de	@schochastics	#23
msnbc.com		#1
n-tv.de	@schochastics	#23
ndr.de	@schochastics	#23
news-und-nachrichten.de	@schochastics	#23
news.de	@schochastics	#23
newsflash24.de	@schochastics	#23
newstatesman.com	@JBGruber
newsweek.com	@JBGruber
nordkurier.de	@schochastics	#23
nos.nl	@JBGruber
novinky.cz	@JBGruber
noz.de	@schochastics	#23
nrc.nl	@JBGruber
nu.nl	@JBGruber
nw.de	@schochastics	#23
nypost.com	@JBGruber
nytimes.com	@JBGruber	#17
nzz.ch	@schochastics	#23
orf.at	@schochastics	#23
ostsee-zeitung.de	@schochastics	#23
pagesix.com		#1
parlamentnilisty.cz	@JBGruber
presseportal.de	@schochastics	#23
prosieben.de	@schochastics	#23
rbb24.de	@schochastics	#23
rnd.de	@schochastics	#23
rollingstone.de	@schochastics	#23
rp-online.de	@schochastics	#23
rte.ie	@JBGruber
rtl.de	@schochastics	#23
rtl.nl	@JBGruber
rtlnieuws.nl	@JBGruber
ruhr24.de	@schochastics	#23
ruhrnachrichten.de	@schochastics	#23
saechsische.de	@schochastics	#23
schwaebische.de	@schochastics	#23
seznamzpravy.cz	@JBGruber
sfgate.com	@JBGruber
shz.de	@schochastics	#23
skwawkbox.org	@JBGruber
sky.com	@JBGruber
spiegel.de	@schochastics	#23
srf.ch	@schochastics	#23
stern.de	@schochastics	#23
stuttgarter-zeitung.de	@schochastics	#23
sueddeutsche.de	@schochastics	#23
suedkurier.de	@schochastics	#23
swp.de	@schochastics	#23
swr.de	@schochastics	#23
swr3.de	@schochastics	#23
swrfernsehen.de	@schochastics	#23
t-online.de	@schochastics	#23
t3n.de	@schochastics	#23
tag24.de	@schochastics	#23
tagesschau.de	@schochastics	#23
tagesspiegel.de	@schochastics	#23
taz.de	@schochastics	#23
techrepublic.com	@JBGruber	#1
telegraaf.nl	@JBGruber	#17
telegraph.co.uk	@JBGruber
tennessean.com	@JBGruber
thecanary.co	@JBGruber
theguardian.com	@JBGruber
thejournal.ie	@JBGruber
thelily.com		#1
thestreet.com	@JBGruber
thesun.ie	@JBGruber
thueringer-allgemeine.de	@schochastics	#23
time.com		#1
tribpub.com		#1
tz.de	@schochastics	#23
us.cnn.com	@JBGruber
usatoday.com	@JBGruber
vice.com	@schochastics	#23
volkskrant.nl	@JBGruber
volksstimme.de	@schochastics	#23
vox.de	@schochastics	#23
wa.de	@schochastics	#23
washingtonpost.com	@JBGruber
watson.ch	@schochastics	#23
watson.de	@schochastics	#23
waz.de	@schochastics	#23
wdr.de	@schochastics	#23
welt.de	@schochastics	#23
wiwo.de	@schochastics	#23
wsj.com	@JBGruber
wz.de	@schochastics	#23
yahoo.com	@JBGruber
zdf.de	@schochastics	#23
zeit.de	@JBGruber

: Runs without known issues
: Runs with some issues
: Currently not working, fix has been requested