0

I'm trying to do text analysis on a dataframe of 792 observations (about 400 MB of data). I merged all my txt files into a single RDS file and saved it so that it would be two columns. However, I notice two weird things:

  1. The text of some of the files are overlapping like you see in the screenshot below when I read it in R. I tried to isolate one of such case, and R seems to be reading the text fine, but I wonder if this is a case of me not merging the files correctly to begin with, because,
  2. It is taking a long time to load/read/execute any command. Whenever I try to tokenise this dataframe, my R just crashes or gets stuck.

Are there any suggestions on what is the main issue here? I thought about cleaning the text before I tokenise them, but that's not the common practice and I would be making some theoretical lapses here.

a sample of my dataframe

The output when I ran dput(head(merged_ptas, 2)) is as follows:

structure(list(doc_id = c("1_Afghanistan India_2003.txt", "10_Albania Kosovo_2003.txt"
), text = c("PREFERENTIAL TRADE AGREEMENT BETWEEN THE REPUBLIC OF INDIA AND THE TRANSITIONAL ISLAMIC STATE OF AFGHANISTAN PREAMBLE The Government of the Republic of India and The Transitional Islamic State of Afghanistan, (hereinafter referred to as the 'Contracting Parties'), CONSIDERING that the expansion of their domestic markets, through economic integration, is a vital prerequisite for accelerating their processes of economic development. BEARING in mind the desire to promote mutually beneficial bilateral trade. CONVINCED of the need to establish and promote free trade for strengthening intraregional economic cooperation and the development of national economies. FURTHER RECOGNISING that progressive reductions and elimination of obstacles to bilateral trade through a bilateral preferential trading arrangement (hereinafter referred to as 'The Agreement') would contribute to the expansion of world trade. HAVE agreed as follows: Article I Objectives 1. The Contracting Parties shall establish a Preferential Trading Arrangement in accordance with the provisions of this Agreement. 2. The objectives of this Agreement are... )), row.names = 1:2, class = "data.frame")

Thanks for your help!

anatrik
  • 27
  • 6
  • It looks like you have some sort of encoding issue with your text. Can you share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with something like `dput(head(yourdata, 2))`? It's impossible to tell exactly what's going on from a picture of your data. – MrFlick Aug 08 '22 at 13:21
  • @MrFlick Apologies, I have edited the main post with my dput output. – anatrik Aug 08 '22 at 13:39
  • That's really the output of `dput()`? For some reason it truncated the text data so it returned an invalid object. I don't think i've never seen it do that before and it kind of defeats the purpose. What if you just did `dput(head(merged_ptas$text, 2))`. Does that still use `...`? – MrFlick Aug 08 '22 at 13:45
  • @MrFlick No, it's not the whole output, because it exceeds the character limit here. But it essentially shows the entire text well, I don't identify anything odd. Should I maybe try saving it in an RData file instead of RDS? – anatrik Aug 08 '22 at 13:55
  • 1
    I've seen that issue before, and thought of it as a `RStudio`/`View()`-issue for certain types of texts. You might want to inspect your data in other ways to figure out your issues. That being said, loading your data in my end immediately crashed RStudio. – harre Aug 08 '22 at 14:03

0 Answers0