Has anyone made or found a script to scrape a subreddit and import it to a Lemmy community? There are a handful of smaller subs that I’d like to mirror over to my instance (with author attribution) but haven’t found anything that works yet. https://github.com/rileynull/RedditLemmyImporter looks promising but links to a non-functioning Python script (tries to use Pushshift, which isn’t working at the moment).

  • Eskuero@lemmy.fromshado.ws
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    1 year ago

    I wrote this the past day, if you feed a single text file with Reddit links on it should work fairly decent https://lemmy.fromshado.ws/post/46

    Migrating my own posts on a local instance

    Cloning comments and iterating over entire subreddits is coded that too though I’m still not sure if it’s a good idea to share that portion or not.

  • retrolasered@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    1 year ago

    Thats kotlin. Someone did post a github gist python script here in the past 24 hours though perhaps thats the one you mean?

    Edit: typos

    • jon@lemmy.tfOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      The LemmyImporter repo expects you to already have all your post data in a json file- it has a link in the readme to a Lemmygrad.ml comment with a Python script. Seems like it would do exactly what I want, if Pushshift was working. I may be able to fiddle with it enough over the weekend to hit Reddit directly, though.

      • retrolasered@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Oh right. Apologies, did a classic and skipped the readme! 10 minutes documentation 10 hours something or other 😆

  • parallax@local106.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    I would suggest that any scraping should either also link back or post a comment linking to the new community, ideally we attract as opposed to just copy

    • jon@lemmy.tfOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Yeah that’s definitely what I want, anything cloned over here would ideally have both author attribution and a direct link to the original Reddit post at the very top of each post.

  • phonelife@beehaw.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    You would need to scrape it using a personal API key which does have rate limits theoretically?

    That would be the most efficient way. You’d need to both write to a database and a document storage for the photos/videos.

    Otherwise you could scrape it through a browser using a library like puppeteer and store it similarly. But that’s probably the worst way to do it considering the API for reddit doesn’t charge yet. It’s really looking for title, (content, link, image or video), and OP. Comments are likely a waste of time to grab in most instances and would be hard to integrate back to Lemmy in its current state.