I'm trying to do text analysis on a dataframe of 792 observations (about 400 MB of data). I merged all my txt files into a single RDS file and saved it so that it would be two columns. However, I notice two weird things:
- The text of some of the files are overlapping like you see in the screenshot below when I read it in R. I tried to isolate one of such case, and R seems to be reading the text fine, but I wonder if this is a case of me not merging the files correctly to begin with, because,
- It is taking a long time to load/read/execute any command. Whenever I try to tokenise this dataframe, my R just crashes or gets stuck.
Are there any suggestions on what is the main issue here? I thought about cleaning the text before I tokenise them, but that's not the common practice and I would be making some theoretical lapses here.
The output when I ran dput(head(merged_ptas, 2)) is as follows:
structure(list(doc_id = c("1_Afghanistan India_2003.txt", "10_Albania Kosovo_2003.txt"
), text = c("PREFERENTIAL TRADE AGREEMENT BETWEEN THE REPUBLIC OF INDIA AND THE TRANSITIONAL ISLAMIC STATE OF AFGHANISTAN PREAMBLE The Government of the Republic of India and The Transitional Islamic State of Afghanistan, (hereinafter referred to as the 'Contracting Parties'), CONSIDERING that the expansion of their domestic markets, through economic integration, is a vital prerequisite for accelerating their processes of economic development. BEARING in mind the desire to promote mutually beneficial bilateral trade. CONVINCED of the need to establish and promote free trade for strengthening intraregional economic cooperation and the development of national economies. FURTHER RECOGNISING that progressive reductions and elimination of obstacles to bilateral trade through a bilateral preferential trading arrangement (hereinafter referred to as 'The Agreement') would contribute to the expansion of world trade. HAVE agreed as follows: Article I Objectives 1. The Contracting Parties shall establish a Preferential Trading Arrangement in accordance with the provisions of this Agreement. 2. The objectives of this Agreement are... )), row.names = 1:2, class = "data.frame")
Thanks for your help!