[tensorflow] Keras dataset offline으로 사용하기

0. 들어가기

-. 항상 온라인으로 작업 하는 요즘이지만.. 경우에 따라 인트라넷/오프라인에서 작업하는 경우가 있다.

-. 지금 내 꼬라지가 인트라넷에 띄워진 서버에서 케라스 공부중이라.. 데이터셋을 다운받아서 사용해야 했다. 이래저래 찾아본 방법 정리.

1. imdb 데이터셋

-. keras의 텍스트 처리 예제에서는 imdb의 영화 감상문을 사용하고, 이건 쉽게 처리 가능하다.

-. 구글 스토리지 접속이 불가할 경우 아래와 같은 오류 메시지가 출력된다.

-. 해결법은 간단하다. 다운받아서 폴더에 넣어주면 됨.

imdb = keras.datasets.imdb

#offline에선 imdb.load_data() 사용 불가.
#파일 다운로드 후 path 지정: http://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

directory_path = ""
file_list = os.listdir(directory_path)
print(file_list)

file_path = directory_path + "imdb.txt"

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(path=file_path, num_words=10000)

2. fashion_mnist 데이터셋

-. 이건 조금 번거롭다. 깃헙에서 원본 데이터를 받아서 넣어줘야 함.

-. gz 파일 압축 풀기

import gzip

#원래 코드이나, 사용 불가
#(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

#github에서 다운로드해온 파일들
#https://github.com/zalandoresearch/fashion-mnist
rawdata_test_images = gzip.open("./MLdata/t10k-images-idx3-ubyte.gz", 'r')
rawdata_test_labels = gzip.open("./MLdata/t10k-labels-idx1-ubyte.gz", 'r')
rawdata_train_images = gzip.open("./MLdata/train-images-idx3-ubyte.gz", 'r')
rawdata_train_labels = gzip.open("./MLdata/train-labels-idx1-ubyte.gz", 'r')

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

test_labels = []
test_images = []
train_labels = []
train_images = []

-. 압축 푼 데이터를 numpy array에 삽입.

rawdata_test_labels.read(8)

while(True):
    buf = rawdata_test_labels.read(1)
    if len(buf)==0:
        break;
    test_labels.append(np.frombuffer(buf, dtype=np.uint8).astype(np.int64))
print(len(test_labels))

image_size = 28

rawdata_test_images.read(16)

while(True):
    buf = rawdata_test_images.read(image_size * image_size)
    if(len(buf)==0):
        break;
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    data = data.reshape(image_size, image_size, 1)
    
    image = np.asarray(data).squeeze()
    test_images.append(image)
print(len(test_images))

rawdata_train_labels.read(8)

while(True):
    buf = rawdata_train_labels.read(1)
    if len(buf)==0:
        break;
    train_labels.append(np.frombuffer(buf, dtype=np.uint8).astype(np.int64))
print(len(train_labels))

image_size = 28

rawdata_train_images.read(16)

while(True):
    buf = rawdata_train_images.read(image_size * image_size)
    if(len(buf)==0):
        break;
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    data = data.reshape(image_size, image_size, 1)

    image = np.asarray(data).squeeze()
    train_images.append(image)
print(len(train_images))