2

I'm aware that similar questions have been asked here but I still believe my task is more complex.

I have a text file with information from the project geonames.org named MX.txt where data is arranged like this:

MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
001      Aguascalientes 01  21.8495 -102.3052   1
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01 21.8495  -102.3052   1
MX  20159   Alianza Ferrocarrilera  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20159   Bosques del Prado Oriente   Aguascalientes  AGU Aguascalientes
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20160   Francisco Guel Jimenez  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20160   Las Viñas INFONAVIT Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20164   Santa Anita 4a Sección  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1

the file goes on for several thousand rows.

I'd like to turn this to a data frame with 12 variables where strings like "Villas del Cobano" are only one entry like this:

V1  V2      V3                  V4              V5  V6
MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
V7  V8              V9  V10     V11         V12
001 Aguascalientes  01  21.8495 -102.3052   1
V1  V2      V3                  V4              V5  V6
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes
V7  V8              V9  V10     V11         V12 
001 Aguascalientes  01 21.8495  -102.3052   1

I already tried previous answers posted here like: Converting text file into data frame in R , converting multiple lines of text into a data frame

Because English is no my first languaje, I'd like to answer questions if my question is not clear enough through the comment section instead of getting negative flags.

Thanks in advance!

Community
  • 1
  • 1
Alfredo Lozano
  • 290
  • 1
  • 3
  • 15
  • 1
    This may be a silly question but have you tried `read.table()` with `sep = "\t"`? Can you show the code you've tried and what you end up with? – LJW Sep 23 '15 at 22:47
  • Thanks for the quick response! No, actually I hadn't, it seemed to work just as I wanted when I accidentally deleted all the file but the first row – Alfredo Lozano Sep 23 '15 at 22:51
  • Now it says :"Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 67894 did not have 12 elements" – Alfredo Lozano Sep 23 '15 at 22:55
  • Just edited such line, worked perfect would you like to write this as an answer so I can mark it answered? @LJW – Alfredo Lozano Sep 23 '15 at 23:00
  • 1
    Just got back to this now - looks like you've got a few good answers below so I won't repeat. Glad you got it working! – LJW Sep 24 '15 at 03:58

3 Answers3

3

The separators between columns are tabs, then use

data <- read.table(file="MX.txt", sep="\t", quote="", comment.char="")

There is an issue with geonames data. Sometimes they use # inside names of places. By default read.table reads as comment discarding the rest of the line, so you need to set comment.char="".

fnd
  • 335
  • 2
  • 9
1

I produced a long winded solution that will probably get what you want. In short, I use the known distance from the start and end of each nested list to isolate the "multipart name", concatenate it, and enter it as a column among the other data.

The function splitAt is from R split numeric vector at position.

#Support functions
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
extractplace <- function(x) {
  len <- length(x)
  place0 <- x[-1*c(1:2,(len-8):len)]
  place <- paste(place0, collapse=" ")
}
extractother <- function(x) {
  len <- length(x)
  other <- x[c(1:2,(len-8):len)]
}

#initital data processing
elems <- scan(file="mx.txt", what="list") #creates a vector of all elements in your txt file
inds <- grep(pattern="MX", elems) #finds indices of "MX", which starts every nested list
lists <- splitAt(elems, inds) #creates a list of nested list

#create the matrix you want
placevector <- sapply(lists, function(x) extractplace(x)) #vector of multipart names
othermatrix <- t(sapply(lists, function(x) extractother(x))) #matrix of remaining data
fullmatrix <- cbind(othermatrix[,1:2],placevector,othermatrix[,3:11]) #inserts multipart names in matrix
colnames(fullmatrix) <- paste("V",1:12, sep="")

fullmatrix 
Community
  • 1
  • 1
CarlAH
  • 217
  • 1
  • 9
  • Sorry for the late feedback, the first answer actually solved my issue but yours is a very intresting/logic solution. Yhank you! – Alfredo Lozano Sep 25 '15 at 17:12
1

This assumes the rest of the data looks like this data. I had to do a lot of cleaning (i.e., gsubing):

The code:

vect <- unlist(Map(function(x, y) paste(x, y), dat[c(T, F)], dat[c(F, T)]), 
    use.names = FALSE)
read.table(text=gsub("\\s{2,}", ", ", gsub("(\\s)(\\d{2,})", "  \\2", 
    gsub("(\\d{2,}|[A-Z]+)\\s+", "\\1  ", vect))), sep=",")

The data for easy read in:

dat <- readLines(n=14)
MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
001      Aguascalientes 01  21.8495 -102.3052   1
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01 21.8495  -102.3052   1
MX  20159   Alianza Ferrocarrilera  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20159   Bosques del Prado Oriente   Aguascalientes  AGU Aguascalientes
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20160   Francisco Guel Jimenez  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20160   Las Viñas INFONAVIT Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20164   Santa Anita 4a Sección  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519