Skip to contents

Collect data from supplied URLs

Usage

pb_collect(
  urls,
  collect_rss = TRUE,
  timeout = 30,
  ignore_fails = FALSE,
  connections = 100L,
  host_con = 6L,
  use_cookies = FALSE,
  useragent = "paperboy",
  save_dir = NULL,
  verbose = NULL,
  ...
)

Arguments

urls

Character object with URLs.

collect_rss

If one of the URLs contains an RSS feed, should it be parsed.

timeout

How long should the function wait for the connection (in seconds). If the query finishes earlier, results are returned immediately.

ignore_fails

normally the function errors when a URL can't be reached due to connection issues. Setting to TRUE ignores this.

connections

max total concurrent connections.

host_con

max concurrent connections per host.

use_cookies

If TRUE, use the cookiemonster package to handle cookies. See add_cookies for details on how to store cookies. Cookies are used to enter articles behind a paywall or consent form.

useragent

String to be sent in the User-Agent header.

save_dir

store raw html data on disk instead of memory by providing a path to a directory.

verbose

A logical flag indicating whether information should be printed to the screen. If NULL will be determined from getOption("paperboy_verbose").

...

Currently not used

Value

A data.frame (tibble) with url status data and raw media text.