1

I have a text file. It contains simple text.

FileName - 234567.txt

I have build the corpus of the content present in this file.I cleaned the corpus and created string vector to store complete text in a single String vector as

FullTextstring <- "Complete text here"

I have also saved File name in a vector as

FileName <- "234567.txt"

I have a table "TextData "in database. Following are the table fields :-

FileId int primary key , where FileID is the FileName
FileContent nvarchar(Max), where FileContent is text presnt in file

I want to insert file name and its content in a database. So , my query is how do I insert value present in "FileName" and "FullTextstring" vectors into my table "TextData".

I am using windows 8.1. I have installed package "RODBC" . I have also looked various examples but could not find anything useful.

r4sn4
  • 117
  • 5
  • 14
  • Did you establish an ODBC connection? If so, just follow examples from the RODBC package documentation – David Arenburg Aug 11 '14 at 05:28
  • Yes I established connection. I am trying using RODBC package. But it has sqlSave() function to insert data frame. I want to insert FileName and FileContent vector separately. – r4sn4 Aug 11 '14 at 05:32
  • I created data frame to store FileId and FileContent. TextTable< - data.frame(t(rep(NA,2))) names(TextTable) <- c("FileID", "FileContent") Then I inserted that dataframe using SQLSave() function in R odbcChannel <- odbcConnect("MyDatabase") sqlSave(odbcChannel, TextTable, rownames=FALSE, append=TRUE) This has answered my own question. – r4sn4 Aug 11 '14 at 06:34
  • Yup, as stated below, this is probably a preferred approach! – joemienko Aug 11 '14 at 06:39
  • I forgot to mention one step : I filled data frame with required data i.e. FileId and FileContent using TextTable$FileId[1] <- FileName[1] and TextTable$FileContent[1]<- FullTextstring – r4sn4 Aug 11 '14 at 06:42
  • Glad to see that you went with the dataframe approach, but if you feel my answer adequately answered your original question, please kindly accept it. – joemienko Aug 11 '14 at 06:51

1 Answers1

2

I have also found some of the RODBC documentation unhelpful if you are new to R or databases or both. Hopefully this can get you started.

As the comment above suggests, the first thing you need to do is establish an ODBC connection to your database. Since you did not specify, I'll assume you are using some version of MS-SQL Server. Instructions for making a connection to MS-SQL server are available here.

Once you have established a connection on your machine (let's call it textSQL), you are going to want to assign that connection to an object in R. Here, I call the connection con.

require(RODBC)
con <- odbcConnect("textSQL") 

If you truly want to insert each value into your TextData from vectors (it seems like combining everything to a dataframe and then updating the db via sqlSave() would be be preferred), then you'll just need to run the SQL to do that. Before you can do that, however, you are going to need to convert your FileName vector into something that can be received by your integer database field. Assuming that all your filesnames all have a three character suffix (plus a period) something like the following should probably work.

# as stated in your question
FileName <- "234567.txt"
FullTextstring <- "Complete text here"

# convert FileName to int
require(stringr)
#filename length
l <- nchar(FileName)
#filename suffix 
s <- 4
FileName <- as.numeric(str_sub(FileName,1,l-s))

Now that you have your filename stored as a numeric in R, we can move to populating your table on the database server. For this, we can simply make use of the sqlQuery() function in conjunction with the paste() function as shown below. Note that the first parameter of the function is the connection object (con) specified above.

sqlQuery(con
         ,paste("insert into TextData values ("
                 ,FileName[1]
                 ,",\'"
                 ,FullTextstring[1]
                 ,"\')"))

If you have multiple filenames and strings in a given vector, just update the [1] accordingly.

Also, please be advised that there is an upper limit on field lengths built into RODBC (I think it is at 64k as mentioned here). If you are dealing with a large corpus of text, it's a good bet that you'll run up against this limit. If this happens, you can modify the source code (it's just one line) and recompile from source.

Update per request in comment

In order to make this change, you are going to need to first download the tar ball from CRAN here.

Next, you are going to need to navigate to the C code located in the \RODBC\src directory.

Around line 735 or so, you are going to see a comment and line of code that looks something like this:

    /* sanity check as the reports are sometimes unreliable */
    if (datalen > 65535) datalen = 65535;

You are going to want to change it to something like this (the new value may be bigger or smaller depending on your needs):

    /* sanity check as the reports are sometimes unreliable */
    if (datalen > 10000000) datalen = 10000000;

Save your changes, and then re-install your modified RODBC package using the instructions provided here.

NOTE: This solution worked for me when I was trying to bring information into R from an RODBC connection. I am a novice C programmer at best and do not completely understand all of Ripley's code. This solution will not break RODBC but it may not allow you to pipe extremely large fields from R into your database. The truncation you are experiencing may be a problem with some of your SQL server settings as opposed to the datalen cap in RODBC.

Community
  • 1
  • 1
joemienko
  • 2,220
  • 18
  • 27
  • Yes when I insert text into TextData table in database. Some text is truncated. what to do in such case? I mean do I remove terms with lower frequency or what? – r4sn4 Aug 11 '14 at 06:39
  • 1
    The only workaround I know for this is to modify the source code and then install RODBC from source as shown [here](http://stackoverflow.com/questions/1474081/how-do-i-install-an-r-package-from-source). The relevant line of code (in the RODBC.c file) is commented with "/* sanity check as the reports are sometimes unreliable */". The default value is 65535 you just need to up this value (I've found 10000000 to work), save, reinstall RODBC from source, and you should be good to go. – joemienko Aug 11 '14 at 06:48
  • joemienko .. thanks for replying :). Learnt something new from your comment. I am trying to re-install RODBC from source . If any issue arises will let you know here. – r4sn4 Aug 11 '14 at 07:13
  • Hi joemienko could you give little detail of how to increase limit from 65535 character while inserting records in SQL databse??? – r4sn4 Aug 12 '14 at 05:24
  • 1
    I have updated my response to provide some more details about how to modify the source code. – joemienko Aug 12 '14 at 15:43