Merge branch 'replies'

2025-05-06 19:43:57 +00:00 · 2023-07-17 21:01:35 +02:00 · 2023-07-17 21:01:35 +02:00 · e512838a0e
commit e512838a0e
parent 5930eb11ec 02b6cec827
3 changed files with 109 additions and 56 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,18 @@
 # Changelog

+**14 JUL 2023** VERSION 4.2
+
+Twoot can now handle threads. All tweets can again be uploaded on Mastodon. Tweets in a threads are
+displayed in reverse chronological order in the main timeline (first tweet on top) to improve readability.
+
+*When several toots are posted in the same run of toot it is possible that these toots do not appear in
+chronological order on the timeline. If it is the case, try setting `upload_pause` to 3-5 seconds in
+your config file to slow down the rate at which toots are uploaded.*
+
+A list of nitter instances to use can now be specified in the config file
+e.g. `nitter_instances = ["nitter.nl", "nitter.fdn.fr"]`.
+If none is specified, the built-in list of 2-3 known good instances is used as before.
+
 **12 JUL 2023** VERSION 4.1

 **Nitter has recently added a change that highlights tweets that are part of a thread. Twoot cannot handle this modification yet therefore TWEETS THAT ARE PART OF A THREAD ARE CURRENTLY IGNORED.** A warning message is added to the log file instead.
--- a/README.md
+++ b/README.md
@ -3,18 +3,11 @@
 Twoot is a python script that mirrors tweets from a twitter account to a Mastodon account.
 It is simple to set-up on a local machine, configurable and feature-rich.

-**14 JUL 2023** VERSION 4.2
+**17 JUL 2023** VERSION 4.3

-Twoot can now handle threads. All tweets can again be uploaded on Mastodon. Tweets in a threads are
-displayed in reverse chronological order in the main timeline (first tweet on top) to improve readability.
-
-*When several toots are posted in the same run of toot it is possible that these toots do not appear in
-chronological order on the timeline. If it is the case, try setting `upload_pause` to 3-5 seconds in
-your config file to slow down the rate at which toots are uploaded.*
-
-A list of nitter instances to use can now be specified in the config file
-e.g. `nitter_instances = ["nitter.nl", "nitter.fdn.fr"]`.
-If none is specified, the built-in list of 2-3 known good instances is used as before.
+* Twitter threads are replicated on Mastodon: each follow-up message in a thread is posted
+as a reply to its predecessor.
+* An issue with downloading videos has been fixed ("ERROR: Sorry, you are not authorized to see this status").

 > Previous updates can be found in CHANGELOG.

@ -22,6 +15,7 @@ If none is specified, the built-in list of 2-3 known good instances is used as b

 * Fetch timeline of given user from twitter.com (through nitter instance)
 * Scrape html and format tweets for post on mastodon
+* Threads (series of replies to own messages) are replicated
 * Emojis supported
 * Upload images from tweet to Mastodon
 * Optionally upload videos from tweet to Mastodon
@ -41,7 +35,7 @@ If none is specified, the built-in list of 2-3 known good instances is used as b

 ## Usage

-```sh
+```
 usage: twoot.py [-h] [-f <.toml config file>] [-t <twitter account>] [-i <mastodon instance>]
                [-m <mastodon account>] [-p <mastodon password>] [-r] [-s] [-l] [-u] [-v] [-o] [-q]
                [-a <max age (in days)>] [-d <min delay (in mins)>] [-c <max # of toots to post>]
@ -85,18 +79,19 @@ to use, all the other command-line parameters are ignored, except `-p` (password

 ### Removing redirected links

-`-l` (or `remove_link_redirections = true` in toml file) will follow every link included in the
-tweet and replace them with the url that the resource is directly dowmnloaded from (if applicable).
-e.g. bit.ly/xxyyyzz -> example.com
+`remove_link_redirections = true` in toml file (or `-l` on the command line ) will follow every link
+included in the tweet and replace them with the url that the resource is directly dowmnloaded from
+(if applicable). e.g. bit.ly/xxyyyzz -> example.com
+
 Every link visit can take up to 5 sec (timeout) depending on the responsiveness of the source
 therefore this option will slow down tweet processing.

-If you are interested by tracker removal (`-u`, `remove_trackers_from_urls = true`) you should
+If you are interested by tracker removal (`remove_trackers_from_urls = true`, `-u`) you should
 also select redirection removal as trackers are often hidden behind the redirection of a short URL.

 ### Uploading videos

-When using the `-v` (`upload_videos = true`) switch consider:
+When using the `upload_videos = true` (`-v`) switch consider:

 * whether the copyright of the content that you want to cross-post allows it
 * the storage / transfer limitations of the Mastodon instance that you are posting to
@ -104,7 +99,7 @@ When using the `-v` (`upload_videos = true`) switch consider:

 ### Updating profile

-If `-q` (`update_profile = true`) is specified, twoot will check if the avatar and banner pictures
+If `update_profile = true` (`-q`) is specified, twoot will check if the avatar and banner pictures
 have changed on the twitter page. This check compares the name of files used by twitter with the names
 of the files that have been uploaded on Mastodon and if they differ both files are downloaded from
 twitter and uploaded on Mastodon. The check is very fast if there is no update.
@ -119,10 +114,9 @@ e.g. `tweet_time_format = "(%d %b %Y %H:%M %Z)"`

 An empty or missing `tweet_time_format` disables the display of the timestamp.

-By default, dates are specified in UTC time zone. To convert the timestamp to another time zone,
-use the `tweet_timezone` option in configuration file. Valid time zone names are those of the Olson time
-zone database (<https://en.wikipedia.org/wiki/Tz_database>)
-e.g. `tweet_timezone = "Europe/Paris"`
+By default, dates are specified in the local timezone of the machine running the script. To display the timestamp to another time zone, use the `tweet_timezone` option in configuration file. Valid time zone names are those of the Olson time
+zone database (<https://en.wikipedia.org/wiki/Tz_database>).
+e.g. `tweet_timezone = "Europe/Paris"` or `tweet_timezone = "UTC"`

 ### Rate control

@ -132,6 +126,9 @@ Default min delay is 0 minutes.

 No limitation is applied to the number of toots uploaded if `-c` is not specified.

+If messages in a thread that are uploaded simultaneously appear in the wrong order, try setting
+the `upload_pause` configuration variable in the configuration file to a few seconds (start with 3-5).
+
 ## Installation

 Make sure python3 is installed.
@ -151,6 +148,9 @@ pip install beautifulsoup4 Mastodon.py youtube-dl2 pytz
 In your user folder, execute `git clone https://gitlab.com/jeancf/twoot.git`
 to clone repo with twoot.py script.

+If you want to use a config file to specify options (recommended), copy `default.toml` to
+`[you_preferred_name].toml` and edit it to your preferences.
+
 Add command line to crontab. For example, to run every 15 minutes starting at minute 1 of every hour
 and process the tweets posted in the last 5 days but at least 15 minutes
 ago:
@ -159,6 +159,8 @@ ago:
 1-59/15 * * * * /path/to/twoot.py -t SuperDuper -i masto.space -m sd@example.com -p my_Sup3r-S4f3*pw -a 5 -d 15
 ```

+After the first successful run, you no longer need to specify the password and yoiucan remove the `-p` switch.
+
 ## Featured Accounts

 Twoot is known to be used for the following feeds (older first):
--- a/twoot.py
+++ b/twoot.py
@ -169,12 +169,17 @@ Dowload page with full thread of tweets and extract all replied to tweet referen
 Only used by `get_timeline()`.
 :param session: Existing HTTP session with Nitter instance
 :param headers: HTTP headers to use
-:param url: url of the thread page to download
-:return: List of tweets from the thread
+:param nitter url: url of the nitter instance to use
+:param thread_url: url of the first tweet in thread
+:return: list of tuples with url of tweet replied-to (or None) and content of tweet
 """
-def _get_rest_of_thread(session, headers, url):
+def _get_rest_of_thread(session, headers, nitter_url, thread_url, first_item):
+    # Add first item to timeline
+    timeline = [(None, first_item)]
+
    logging.debug("Downloading tweets in thread from separate page")
    # Download page with thread
+    url = nitter_url + thread_url
    try:
        thread_page = session.get(url, headers=headers, timeout=HTTPS_REQ_TIMEOUT)
    except requests.exceptions.ConnectionError:
@ -201,14 +206,29 @@ def _get_rest_of_thread(session, headers, url):

    # Get all items in thread after main tweet
    after_tweet = soup.find('div', 'after-tweet')
+    list = after_tweet.find_all('div', class_='timeline-item')

-    timeline = after_tweet.find_all('div', class_='timeline-item')
+    # Build timeline of tuples
+    previous_tweet_url = thread_url
+    for item in list:
+            timeline.append((previous_tweet_url, item))
+            # Get the url of the tweet
+            tweet_link_tag = item.find('a', class_='tweet-link')
+            if tweet_link_tag is not None:
+                previous_tweet_url = tweet_link_tag.get('href').strip('#m')
+            else:
+                previous_tweet_url = None
+                logging.error('Thread tweet is missing link tag')
+
+    # return timeline in reverse chronological order
+    timeline.reverse()
    return timeline

+
 """
-Dowload page with full thread of tweets. Only used by `get_timeline()`.
-:param url: url of the thread page to download
-:return: List of tweets from the thread
+Download timeline of twitter account
+:param url: url of the account page to download
+:return: list of tuples with url of tweet replied-to (or None) and content of tweet
 """
 def get_timeline(nitter_url):
    # Define url to use
@ -268,17 +288,18 @@ def get_timeline(nitter_url):
    for item in list:
        classes = item['class']
        if 'timeline-item' in classes:  # Individual tweet
-            timeline.append(item)
+            timeline.append((None, item))
        elif 'thread-line' in classes:  # First tweet of a thread
            # Get the first item of thread
            first_item = item.find('div', class_='timeline-item')
-            timeline.append(first_item)

-            # Get the rest of the items of the thread
+            # Get the url of the tweet
            thread_link_tag = item.find('a', class_='tweet-link')
            if thread_link_tag is not None:
-                thread_url = thread_link_tag.get('href')
-            timeline.extend(_get_rest_of_thread(session, headers, nitter_url + thread_url))
+                thread_url = thread_link_tag.get('href').strip('#m')
+
+            # Get the rest of the items of the thread
+            timeline.extend(_get_rest_of_thread(session, headers, nitter_url, thread_url, first_item))
        else:
            # Ignore other classes
            continue
@ -647,29 +668,32 @@ def process_attachments(nitter_url, attachments_container, status_id, author_acc

    # Download twitter video
    vid_in_tweet = False
-    vid_class = attachments_container.find('div', class_='video-container')
-    if vid_class is not None:
+    vid_container = attachments_container.find('div', class_='video-container')
+    if vid_container is not None:
        if TOML['options']['upload_videos']:
            logging.debug("downloading video from twitter")
            import youtube_dl

-            video_path = f"{author_account}/status/{status_id}"
-            video_file = urljoin('https://twitter.com', video_path)
-            ydl_opts = {
-                'outtmpl': "output/" + TOML['config']['twitter_account'] + "/" + status_id + "/%(id)s.%(ext)s",
-                'format': "best[width<=500]",
-                'socket_timeout': 60,
-                'quiet': True,
-            }
+            video_path = vid_container.source['src']
+            if video_path is not None:
+                video_file = urljoin(nitter_url, video_path)
+                ydl_opts = {
+                    'outtmpl': "output/" + TOML['config']['twitter_account'] + "/" + status_id + "/%(id)s.%(ext)s",
+                    # 'format': "best[width<=500]",
+                    'socket_timeout': 60,
+                    'quiet': True,
+                }

-            with youtube_dl.YoutubeDL(ydl_opts) as ydl:
-                try:
-                    ydl.download([video_file])
-                except Exception as e:
-                    logging.warning('Error downloading twitter video: ' + str(e))
-                    vid_in_tweet = True
-                else:
-                    logging.debug('downloaded twitter video from attachments')
+                with youtube_dl.YoutubeDL(ydl_opts) as ydl:
+                    try:
+                        ydl.download([video_file])
+                    except Exception as e:
+                        logging.warning('Error downloading twitter video: ' + str(e))
+                        vid_in_tweet = True
+                    else:
+                        logging.debug('downloaded twitter video from attachments')
+            else:
+                vid_in_tweet = True

    return pics, vid_in_tweet

@ -923,8 +947,9 @@ def main(argv):
               mastodon_account TEXT, tweet_id TEXT, toot_id TEXT)''')
    db.execute('''CREATE INDEX IF NOT EXISTS main_index ON toots (twitter_account,
               mastodon_instance, mastodon_account, tweet_id)''')
+    db.execute('''CREATE INDEX IF NOT EXISTS tweet_id_index ON toots (tweet_id)''')
    db.execute('''CREATE TABLE IF NOT EXISTS profiles (mastodon_instance TEXT, mastodon_account TEXT, avatar_url TEXT, banner_url TEXT)''')
-    db.execute('''CREATE INDEX IF NOT EXIsTS profile_index ON profiles (mastodon_instance, mastodon_account)''')
+    db.execute('''CREATE INDEX IF NOT EXISTS profile_index ON profiles (mastodon_instance, mastodon_account)''')

    # Select random nitter instance to fetch updates from
    nitter_url = 'https://' + TOML['options']['nitter_instances'][random.randint(0, len(TOML['options']['nitter_instances']) - 1)]
@ -942,7 +967,7 @@ def main(argv):
    tweets = []
    out_date_cnt = 0
    in_db_cnt = 0
-    for status in timeline:
+    for replied_to_tweet, status in timeline:
        # Extract tweet ID and status ID
        tweet_id = status.find('a', class_='tweet-link').get('href').strip('#m')
        status_id = tweet_id.split('/')[3]
@ -1105,6 +1130,7 @@ def main(argv):
            "tweet_text": tweet_text,
            "video": video_file,
            "photos": photos,
+            "replied_to_tweet": replied_to_tweet,
        }
        tweets.append(tweet)

@ -1172,13 +1198,25 @@ def main(argv):
                            TypeError):  # Media cannot be uploaded (invalid format, dead link, etc.)
                        pass

+        # Find in database toot id of replied_to_tweet
+        replied_to_toot = None
+        if tweet['replied_to_tweet'] is not None:
+            logging.debug("Searching db for toot corresponding to replied-to-tweet " + tweet['replied_to_tweet'])
+            db.execute("SELECT toot_id FROM toots WHERE tweet_id=?", [tweet['replied_to_tweet']])
+            replied_to_toot = db.fetchone()
+
+            if replied_to_toot is None:
+                logging.warning('Replied-to tweet %s not found in database', tweet['replied_to_tweet'])
+            else:
+                logging.debug("toot %s found", replied_to_toot)
+
        # Post toot
        toot = {}
        try:
            if len(media_ids) == 0:
-                toot = mastodon.status_post(tweet['tweet_text'])
+                toot = mastodon.status_post(tweet['tweet_text'], replied_to_toot)
            else:
-                toot = mastodon.status_post(tweet['tweet_text'], media_ids=media_ids)
+                toot = mastodon.status_post(tweet['tweet_text'], replied_to_toot, media_ids=media_ids)

        except MastodonAPIError:
            # Assuming this is an: