0

I'm trying to import a .txt file containing scraped social media messages. The .txt file looks like this:

Screenshot of .txt file

The .txt file was created in a Python script where I extract the messages and save them as a comma-separated list to a .txt file for further manipulation in R.

In my R script, I import this .txt file using read.table (that seems to give me the best results so far) (specifically: read.table("output_messages_formatted.txt", sep = ",", header=FALSE, encoding="UTF-8")), but I see that for some reason the import stops partway through the file. What is strange is that the parsing of items seems to go splendidly for a good long while, until, for some reason, it is no longer able to recognise a new list item starting ánd it cuts off another list item mid-way. The last item that R imports is this:

*Thanks so much all for participating in my ii Type Quiz "experiment" last week! Looking forward to sharing and discussing how many people got what archetype etc during my @interintellect_ salon this Wed (7pm CET)! Join us, it\ll be brill! ✨https:[url], @TheAnnaGat @jean_twenge Ohhhh juicy, Hey all! For a @GEMH_Lab mobile app dev project that’s part of my PhD we’re looking for a "

So two problems appear: the item should have ended following the url. Instead, it continues importing an entire separate item (@TheAnnaGat @jean_twente Ohhh juicy) + the first half of another item, which it then appears to cut off for no clear reason. I'm stumped.

I've tried inspecting the .txt file at the location of the break point but I see nothing strange. Here a screenshot of the .txt file at that specific point (I've highlighted the location):

Screenshot of .txt file at problematic location

The next character after R cuts off the import is a # but I see elsewhere in the .txt file that # symbols have featured before in the text and that R had no issue importing those messages.

I've also tried to use different importing functions such as readLines or read.csv, but those seem to be worse at handling the file and end up importing even fewer data lines (also no idea why).

While no error is thrown, it's clear that something is going wrong but I'm sadly unable to figure out what. All help/tips are very welcome. (:

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064). The problem with merged records is usually mismatched quote symbols in the text. Note that `read.table` assumes that lines that start with "#" are comments so it will ignore them. – MrFlick Feb 27 '23 at 15:00
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 27 '23 at 17:48

0 Answers0