0

I am working with twitter data on notepad ++ (xml files) and I'm trying to remove the retweets. Each RT starts with '<tweet id', they all contain '>RT @' and they all end with '</tweet'. Unfortunately due to Twitter's API terms and conditions I can't share examples from the data with you, so hopefully this gives you enough info to help.

The problem I'm having is that sometimes the metadata inbetween '<tweet id' and '>RT @' spans across multiple lines, and I can't seem to find a regex which will capture RT's that occur on both single and multiple lines.

This is the regex I have which captures single line RT's:

(<tweet id).+?(>RT @).+?(/tweet>)

Does anyone have any ideas on what I can add to it so that it will scoop up RT's (and their accompanying metadata) which span accross multiple lines too?

Example RT. I've altered some of the names and the content of the RT but the format remains the same. note there are two examples below, the second one which contains an emoji begins after 'this is an example which contains an emoji':

<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description

Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>```



This is an example with emoji's: 

```<tweet id='1783646' createdAt='2010-01-26T19:38:13.000Z' language='en' authorId='djsjchk' authorUsername='example' authorName='example' authorVerified='FALSE' authorDescription='example' authorLocation='example' authorCreatedAt='2009-06-26T19:50:16.000Z' authorFollowersCount='647' authorFollowingCount='204' authorTweetCount='6045' authorListedCount='31' referencedTweetId='8247516385' referencedTweetCreatedAt='2010-01-26T19:36:15.000Z' referencedTweetText='example' referencedTweetRetweetCount='1' referencedTweetReplyCount='0' referencedTweetLikeCount='0' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example ' referencedTweetAuthorVerified='FALSE' referencedTweetAuthorDescription='examples. #TCSC' referencedTweetAuthorLocation='Find me at' referencedTweetAuthorCreatedAt='2010-01-23T20:05:52.000Z' referencedTweetAuthorFollowersCount='25803' referencedTweetAuthorFollowingCount='3176' referencedTweetAuthorTweetCount='58883' referencedTweetAuthorListedCount='0' retweetCount='1' replyCount='0' likeCount='0' quoteCount='0' >RT @example: this is an example RT </tweet>```

Tara
  • 1
  • 2
  • Please note - [you should not parse XML with RegEx](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – help-info.de Mar 13 '23 at 18:33
  • Can you share the twitter data example with code format(not image)? – Bench Vue Mar 13 '23 at 19:04
  • @BenchVue I've added in an example for you. – Tara Mar 14 '23 at 09:21
  • @Tara, your XML is not a code block format. Hard to find start and end tag. Can you make the correct format? This is [instruction](https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks). Simply start ``` (three backticks) and CTRL+ Enter, then your XML, finally, Enter Key then ``` (three backticks) again. – Bench Vue Mar 14 '23 at 11:10
  • @BenchVue I think I've done it now - is that better? – Tara Mar 14 '23 at 11:38

2 Answers2

0

How about using python instead of notepadd++

This code with xml.etree.ElementTree library and tweet inside code. It will get the attribute's value and RT text.

#1 install ElementTree library

pip install pycopy-xml.etree.ElementTree

#2 Save as get-tweet.py file.

import xml.etree.ElementTree as ET

xml = """\
<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description

Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>
"""

root = ET.fromstring(xml)
print("root: " + str(root))
print("root.tag: " + str(root.tag))
print("root.attrib: " + str(root.attrib))
print(type(root.attrib))
for key in root.attrib.keys():
    print(key +': '+root.attrib[key])
print("text: " + str(root.text))

#3 run it

python get-tweet.py

#4 Result

$ python get-tweet.py
root: <Element 'tweet' at 0x000001593FED13F0>
root.tag: tweet
root.attrib: {'id': '827364918734', 'createdAt': '2011-01-16T18:13:02.000Z', 'language': 'en', 'authorId': '673829', 'authorUsername': 'exampleuser', 'authorName': 'example', 'authorVerified': 'TRUE', 'authorDescription': 'example description', 'authorLocation': 'example location', 'authorCreatedAt': '2009-05-10T05:02:51.000Z', 'authorFollowersCount': '830211', 'authorFollowingCount': '1763', 'authorTweetCount': '34209', 'authorListedCount': '7589', 'referencedTweetId': '26690653563912192', 'referencedTweetCreatedAt': '2011-01-16T17:22:02.000Z', 'referencedTweetText': 'example reference tweet text', 'referencedTweetRetweetCount': '9', 'referencedTweetReplyCount': '0', 'referencedTweetLikeCount': '2', 'referencedTweetQuoteCount': '0', 'referencedTweetAuthorUsername': 'example', 'referencedTweetAuthorName': 'example', 'referencedTweetAuthorVerified': 'TRUE', 'referencedTweetAuthorDescription': 'example description  Check out @example, our new example', 'referencedTweetAuthorLocation': 'example', 'referencedTweetAuthorCreatedAt': '2008-08-27T15:24:02.000Z', 'referencedTweetAuthorFollowersCount': '1380523', 'referencedTweetAuthorFollowingCount': '1035', 'referencedTweetAuthorTweetCount': '402492', 'referencedTweetAuthorListedCount': '22425', 'retweetCount': '9', 'replyCount': '0', 'likeCount': '0', 'quoteCount': '0'}
<class 'dict'>
id: 827364918734
createdAt: 2011-01-16T18:13:02.000Z
language: en
authorId: 673829
authorUsername: exampleuser
authorName: example
authorVerified: TRUE
authorDescription: example description
authorLocation: example location
authorCreatedAt: 2009-05-10T05:02:51.000Z
authorFollowersCount: 830211
authorFollowingCount: 1763
authorTweetCount: 34209
authorListedCount: 7589
referencedTweetId: 26690653563912192
referencedTweetCreatedAt: 2011-01-16T17:22:02.000Z
referencedTweetText: example reference tweet text
referencedTweetRetweetCount: 9
referencedTweetReplyCount: 0
referencedTweetLikeCount: 2
referencedTweetQuoteCount: 0
referencedTweetAuthorUsername: example
referencedTweetAuthorName: example
referencedTweetAuthorVerified: TRUE
referencedTweetAuthorDescription: example description  Check out @example, our new example
referencedTweetAuthorLocation: example
referencedTweetAuthorCreatedAt: 2008-08-27T15:24:02.000Z
referencedTweetAuthorFollowersCount: 1380523
referencedTweetAuthorFollowingCount: 1035
referencedTweetAuthorTweetCount: 402492
referencedTweetAuthorListedCount: 22425
retweetCount: 9
replyCount: 0
likeCount: 0
quoteCount: 0
text: RT @example this is an example RT

#Note - read tweet from file

If you want to read from xml file. It will get the same result.

import xml.etree.ElementTree as ET

tree = ET.parse('tweet_data.xml')
root = tree.getroot()
print("root: " + str(root))
print("root.tag: " + str(root.tag))
print("root.attrib: " + str(root.attrib))
print(type(root.attrib))
for key in root.attrib.keys():
    print(key +': '+root.attrib[key])
print("text: " + str(root.text))

You can modify and write by ElementTree

To modify & write XML file in here

Bench Vue
  • 5,257
  • 2
  • 10
  • 14
0

Your regex is not so bad, you forget the flag . matches newline.

  • Ctrl+H
  • Find what: <tweet id=(?:(?!</tweet>).)+?RT @.+?</tweet>
  • Replace with: LEAVE EMPTY
  • TICK Match case
  • TICK Wrap around
  • SELECT Regular expression
  • TICK . matches newline
  • Replace all

Explanation:

<tweet id=      # literally
    (?:             # non capture group
        (?!             # negative lookahead, make sure we haven't after:
            </tweet>        # literally
        )               # end lookahead
        .               # any character
    )+?             # end group, may appear 1 or more times, not greedy
RT @            # literally
.+?             # 1 or more any character, not greedy
</tweet>        # literally

Screenshot (before):

enter image description here

Screenshot (after): enter image description here

Toto
  • 89,455
  • 62
  • 89
  • 125
  • Hi, for some reason it's stillnot picking up all RT's, even though they follow the same format - any ideas? – Tara Mar 16 '23 at 16:14
  • @Tara: Could show an example that doesn't match? Please, create a test case at https://regex101.com/ – Toto Mar 16 '23 at 17:00
  • From what I can tell it's example's where there are emoji's, I've added an example with an emoji to my original question for you – Tara Mar 16 '23 at 17:42
  • [Works for me](https://regex101.com/r/Yg6PNt/1) – Toto Mar 16 '23 at 17:58